Is it possible to upgrade Apache Spark to 2.2 or 2.3?

ozzioma commented 6 years ago

Hi,

Great work! I need to use Spark 2.2 or higher in a course I'm prepping for. Is it possible to upgrade Spark for either Jupyter or Zeppelin?

Thanks

jerry1100 commented 6 years ago

Thanks @ozzioma!

I'm not entirely sure. I'll ask my teammates who know more about the images.

jerry1100 commented 6 years ago

You can choose the Python 3.5 with Spark option on the install page which comes with Spark v2.2.1.

screen shot 2018-08-20 at 11 46 30 pm

Let me know if this solved your issue.

ozzioma commented 6 years ago

@jerry1100 thanks for the update! Good to know Spark 2.2.1 is available.

However the installation has been stuck for over 4 hours at 14% "Downloading images". I cancelled and reconnected to a different wifi connection and now it's stuck at 10% for over 5 hours downloading images. Thing is I can't tell if the download process is frozen, stuck or not working etc. I have attached the logs. dsx-desktop.log

Will it be ok to suggest a progress text box or text panel of some sort is added to the installer? Sort of a progress/status indicator logging the current operations and any errors encountered by the installer. The current installer 1.2.4 is so opaque and the logs contain little or no information to help you figure something has gone wrong midway. Staring at a screen for hours on end and guessing what might be keeping the download progress at 10% is not as cool as having a screen with live updates.

Thanks!

jerry1100 commented 6 years ago

It seems like you tried to install all the images (Jupyter, RStudio, and Zeppelin) at once. Unless you have a consistent wifi connection, I recommend installing them one at a time.

Although the installer may seem opaque, we're actually showing all the information we have. Under the covers, we're using a docker pull to download the image (e.g. Jupyter), and the only feedback we get is the current size and the total size, which we use to calculate the percentage.

So if the installer is stuck, it's because the actual docker pull command is stuck.

ozzioma commented 6 years ago

Thanks, download and installation was successful. Just took a while longer than expected with some restarts in between.

I noticed the Spark version with Zeppelin notebooks is 2.0.2, while the version with Jupyter notebooks is 2.2.1. Any reason for that? How can I set the same Spark version across all notebooks?

Next concern is how do I access a local or remote Kafka cluster, or Hadoop cluster? Tried reading from a local HDFS folder in a Spark notebook and got the connection refused exception.

I'm prepping DSX desktop for an interactive hands on course, and I need to demonstrate scenarios requiring Kafka and HDFS.

Thanks.

On Thu, Aug 23, 2018, 1:21 AM Jerry Wu notifications@github.com wrote:

It seems like you tried to install all the images (Jupyter, RStudio, and Zeppelin) at once. Unless you have a consistent wifi connection, I recommend installing them one at a time.

Although the installer may seem opaque, we're actually showing all the information we have. Under the covers, we're using a docker pull to download the image (e.g. Jupyter), and the only feedback we get is the current size and the total size, which we use to calculate the percentage.

So if the installer is stuck, it's because the actual docker pull command is stuck.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/IBMDataScience/DSx-Desktop/issues/78#issuecomment-415230351, or mute the thread https://github.com/notifications/unsubscribe-auth/AA_5xUZreUxZAVbke1cJNtT1urAqdkqkks5uTfV6gaJpZM4WEpeP .

ozzioma commented 6 years ago

Hi, I just restarted my PC and boom, DSX cannot load the docker images it took me a whole day to download! Infact when I start up DSX, I am prompted with the setup screen all over again!

Was it the new Wifi IP address? I had to do a docker-machine regenerate-certs default to enable Kitematic startup docker. How is that supposed to affect already downloaded images? My other docker containers run after the cert regen, so what happened to the DSX docker images? Wow, what could have gone wrong? Thought the docker images were persistent? What's the need to download the images ALL OVER AGAIN???

jerry1100 commented 6 years ago

Okay, let me explain how it works.

Overview

When DSX starts, it creates a docker-machine called ibm-dsx. This is a VM which contains the actual docker daemon and is where the images are being saved. So while the images are persistent, they only persist as long as the VM exists.

The problem

When you restart your computer, the docker-machine shuts down. Then when you open DSX Desktop again, DSX tries to start the docker-machine and if it fails after multiple attempts, it will remove it and create it again. When this happens, all your images are lost.

Not sure why your docker-machine failed to start up. I don't think the new ip address would affect it, nor the docker-machine regenerate-certs default command (which actually affects the default machine). If you attach your logs, I could take a quick look.

Potential workaround

As a potential workaround (meaning may/may not work), whenever you restart your computer, you first manually start the docker machine by opening VirtualBox and double-clicking ibm-dsx, then after the machine has started, open DSX Desktop.

Also regarding your question in the previous post: Zeppelin and Jupyter have different spark versions because they are different images. If you're inclined, you can modify the Zeppelin image to use the new spark version but I'm not familiar with the Zeppelin image so I can't tell you what to change.

Not sure about the Kafka or Hadoop cluster, sorry.

ozzioma commented 6 years ago

Thanks for the detailed response! I actually understand how Docker works, I found DSX Desktop more attractive over a minikube cluster I was setting up for an interactive Spark course. Had a couple of apps to setup and it would have been overwhelming for the students to follow the many steps involved in setting up Hadoop, Hive,Kafka, Spark, Toree, Jupyter etc. DSX came in handy for the interactive environment setup.

While I'm not an expert yet on Docker, I was baffled to discover DSX had no built in recovery for very common scenarios like the one I described. Why delete the whole 8GB of images and download them again? If I understand how these things work, there should something like a state/status records showing the last consistent/stable state of the docker images last time they were started. Was expecting a check on the last stable state or a run through of inbuilt expected Docker issues before deleting the entire images.

Would it be OK to suggest DSX implements recovery mechanisms for common scenarios that come up when using Docker in desktop environments? Like it's almost guaranteed most users will encounter scenarios like

Wifi connection changes, connecting to a different Wifi etc
Need to access externally hosted services like HDFS, Kafka, S3
Ability to modify the images or specific app versions (e.g Spark) in place
Mount local folders in DSX as opposed to loading files to the "asset" folder
among others...

DSX desktop really accelerates the adoption of big data processing applications for cloud vendors looking to lure (pun intended) new and old devs into buying their cloud products. The least one could expect is a well thought out entry experience with minimal surprises getting it working. DSX desktop and local stands out from other offerings, I just think it needs more fine tuning. The entire big data processing narrative boils down to processing a bunch of files somewhere or reading them off a stream. It's not a new concept. I don't think it should be that hard to setup an environment where one can read a JSON or CSV file, run the loaded data through an SQL engine or some other computation...and print out results in a chart or grid.

My gripe was about the slow bandwidth where I'm working from....took me a whole night to download the last one.

The logs are attached. dsx-desktop.log

jerry1100 commented 5 years ago

Hey @ozzioma, sorry for the late reply.

Unfortunately, this project has been discontinued and is succeeded by Watson Studio Dekstop. You can sign up for a free trial, but right now it only includes SPSS Modeler and no Notebooks or RStudio.

That said, I'm going to close this issue since there won't be any updates to DSX Desktop.

On a side note, I'm working on a troubleshooting guide for users that still use DSX Desktop, so be on the lookout for that. Right off the bat though, Windows 7 and Windows 10 Home (basically anything that doesn't support hyper-v) makes it difficult to work with Docker. So if it's possible to switch to Windows 10 Pro or Ubuntu, that would be better.

Edit: Windows 10 Education also supports hyper-v. You mentioned that you're preparing for a course, so see if you can sign up for Windows 10 Education.

IBMDataScience / DSx-Desktop