andyburgin / hadoopi

This project contains the configuration files and chef code to configure a cluster of five Raspberry Pi 3s as a working Hadoop running Hue.
http://data.andyburgin.co.uk
39 stars 17 forks source link

Hue Notebook cannot restore or start new Scala/PySpark sessions. #1

Closed mariustudor closed 7 years ago

mariustudor commented 7 years ago

On a RP cluster with 2 masters and 3 workers, after running the startup scripts, all the pages/features of Hue are working, excepting the notebook. This cannot restore the Scala and PySpark sessions (new notebook or Sample notebook scenarios). The spark-shell and pyspark can be started from command prompt and both work. This does not fix the issue, though.

Screen shots: error log

Logs: hue-logs-1492758362.44.zip livy-hduser-server.zip

andyburgin commented 7 years ago

Hmmm, lets try emulating what hue does, so on master02 restart livy and run:

curl -X POST --data '{"kind": "pyspark", "executorMemory": "128M", "driverMemory": "128M"}' -H "Content-Type: application/json" master02:8998/sessions

that should return

{"id":0,"owner":null,"proxyUser":null,"state":"starting","kind":"pyspark","log":[]}

we now have a session"0", now repeatedly run the following until the state changes to "idle"

curl localhost:8998/sessions/0

as in

{"id":0,"owner":null,"proxyUser":null,"state":"idle","kind":"pyspark","log":[]}

the spark session is now ready to run our code, so issue

curl master02:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"1 + 1"}'

that will return {"id":0,"state":"running","output":null}

now lets wait for the result by running

curl master02:8998/sessions/0/statements/0

until you see

{"id":0,"state":"available","output":{"status":"ok","execution_count":0,"data":{"text/plain":"2"}}}

and to be tidy lets delete the session

curl master02:8998/sessions/0 -X DELETE

so let me know how that gets on and can you also post /opt/livy/logs/log.out

mariustudor commented 7 years ago

Andy,

Thank you for the detailed troubleshooting procedure.

Executed the following on master02:

curl -X POST --data '{"kind": "pyspark", "executorMemory": "128M", "driverMemory": "128M"}' -H "Content-Type: application/json" master02:8998/sessions

It returned

{"id":7,"owner":null,"proxyUser":null,"state":"starting","kind":"pyspark","log":[]}

Ran repeatedly

curl localhost:8998/sessions/7

It eventually ended up in a dead state:

{"id":7,"owner":null,"proxyUser":null,"state":"dead","kind":"pyspark","log":[]}

The application diagnostics at http://master01:8088/cluster/app/application_[x], indicated the exit code 15:

Application application_1493736142133_0007.pdf

The exit code 15 can be seen in the master log http://master01:8088/logs/yarn-hduser-resourcemanager-master01.log as well:

yarn-hduser-resourcemanager-master01.log.txt

One of the job history logs showed that connection was refused:

17/05/02 11:05:22 ERROR yarn.ApplicationMaster: User class threw exception: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused: master02/10.0.0.12:34393 java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused: master02/10.0.0.12:34393

master01_19888_jobhistory_logs_worker03_37905_container_1493736142133_0008_01_000001_container_1493736142133_0008_01_000001_hduser_stderr__start=0.pdf

Continued posting sessions with curl until livy session # 9 was successfully accepted. The rest of the steps went fine.

console_output.txt

Now, when I went back to the Sample Notebook or tried to create a new one, I still have the problem with the Spark session timing out.

The file /opt/livy/logs/log.out:

log.out.txt

One interesting thing is the required executor memory being above the max threshold in some of the failed sessions, while session 9 does not have the error. The sessions 10 and 11 correspond to actions done in the notebook.

andyburgin commented 7 years ago

From the log file it looks like livy was already running, so lets stop and start the services on master02 by running the master02-stop.sh and then master02-startup.sh scripts - also make sure all running jobs in yarn have finished on the cluster.

It looks like the driver and executor memory options aren't making it to Livy so can you send me /opt/hue/desktop/libs/notebook/src/notebook/connectors/spark_shell.py from master01 so I can check the memory overrides were added ?

The other thing to try is install tcpdump on master02 sudo apt-get install tcpdump

Then start a spark session from the notebook context menu in hue and look for the request to livy arriving at master02 sudo tcpdump -c 20 -s 0 -i wlan0 -A host master01 and tcp port 8998

Look for something like... `.POST /sessions HTTP/1.1 Host: master02:8998 Connection: keep-alive Accept-Encoding: gzip, deflate Accept: / User-Agent: python-requests/2.10.0 Content-Type: application/json Content-Length: 206

{"files": [], "pyFiles": [], "kind": "pyspark", "proxyUser": "hduser", "driverMemory": "256m", "queue": "default", "archives": [], "executorCores": 1, "driverCores": 1, "jars": [], "executorMemory": "256m"}`

and post the result.

mariustudor commented 7 years ago

It turned out that my Hue build did not include the correct values (256mb) for the spark driver and executor memory in the file /opt/hue/desktop/libs/notebook/src/notebook/connectors/spark_shell.py

For testing purposes, I updated the file and restarted Hadoop using the appropriate scripts on both master01 and master02. I'm able to start Scala and R sessions in the notebook now, although not all the sessions which I tried were successful, due to the Hadoop limitations on ARM memory model. You mentioned that the best approach is to rebuild Hue as the properties can't be easily overridden post installation.

Andy, thanks for much for helping with this! I will close the issue, unless you have extra comments.

andyburgin commented 7 years ago

Awesome, have fun and I'll close the issue