Closed mariustudor closed 7 years ago
Hmmm, lets try emulating what hue does, so on master02 restart livy and run:
curl -X POST --data '{"kind": "pyspark", "executorMemory": "128M", "driverMemory": "128M"}' -H "Content-Type: application/json" master02:8998/sessions
that should return
{"id":0,"owner":null,"proxyUser":null,"state":"starting","kind":"pyspark","log":[]}
we now have a session"0", now repeatedly run the following until the state changes to "idle"
curl localhost:8998/sessions/0
as in
{"id":0,"owner":null,"proxyUser":null,"state":"idle","kind":"pyspark","log":[]}
the spark session is now ready to run our code, so issue
curl master02:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"1 + 1"}'
that will return
{"id":0,"state":"running","output":null}
now lets wait for the result by running
curl master02:8998/sessions/0/statements/0
until you see
{"id":0,"state":"available","output":{"status":"ok","execution_count":0,"data":{"text/plain":"2"}}}
and to be tidy lets delete the session
curl master02:8998/sessions/0 -X DELETE
so let me know how that gets on and can you also post /opt/livy/logs/log.out
Andy,
Thank you for the detailed troubleshooting procedure.
Executed the following on master02:
curl -X POST --data '{"kind": "pyspark", "executorMemory": "128M", "driverMemory": "128M"}' -H "Content-Type: application/json" master02:8998/sessions
It returned
{"id":7,"owner":null,"proxyUser":null,"state":"starting","kind":"pyspark","log":[]}
Ran repeatedly
curl localhost:8998/sessions/7
It eventually ended up in a dead state:
{"id":7,"owner":null,"proxyUser":null,"state":"dead","kind":"pyspark","log":[]}
The application diagnostics at http://master01:8088/cluster/app/application_[x], indicated the exit code 15:
Application application_1493736142133_0007.pdf
The exit code 15 can be seen in the master log http://master01:8088/logs/yarn-hduser-resourcemanager-master01.log as well:
yarn-hduser-resourcemanager-master01.log.txt
One of the job history logs showed that connection was refused:
17/05/02 11:05:22 ERROR yarn.ApplicationMaster: User class threw exception: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused: master02/10.0.0.12:34393 java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused: master02/10.0.0.12:34393
Continued posting sessions with curl until livy session # 9 was successfully accepted. The rest of the steps went fine.
Now, when I went back to the Sample Notebook or tried to create a new one, I still have the problem with the Spark session timing out.
The file /opt/livy/logs/log.out:
One interesting thing is the required executor memory being above the max threshold in some of the failed sessions, while session 9 does not have the error. The sessions 10 and 11 correspond to actions done in the notebook.
From the log file it looks like livy was already running, so lets stop and start the services on master02 by running the master02-stop.sh and then master02-startup.sh scripts - also make sure all running jobs in yarn have finished on the cluster.
It looks like the driver and executor memory options aren't making it to Livy so can you send me /opt/hue/desktop/libs/notebook/src/notebook/connectors/spark_shell.py
from master01 so I can check the memory overrides were added ?
The other thing to try is install tcpdump on master02
sudo apt-get install tcpdump
Then start a spark session from the notebook context menu in hue and look for the request to livy arriving at master02
sudo tcpdump -c 20 -s 0 -i wlan0 -A host master01 and tcp port 8998
Look for something like... `.POST /sessions HTTP/1.1 Host: master02:8998 Connection: keep-alive Accept-Encoding: gzip, deflate Accept: / User-Agent: python-requests/2.10.0 Content-Type: application/json Content-Length: 206
{"files": [], "pyFiles": [], "kind": "pyspark", "proxyUser": "hduser", "driverMemory": "256m", "queue": "default", "archives": [], "executorCores": 1, "driverCores": 1, "jars": [], "executorMemory": "256m"}`
and post the result.
It turned out that my Hue build did not include the correct values (256mb) for the spark driver and executor memory in the file
/opt/hue/desktop/libs/notebook/src/notebook/connectors/spark_shell.py
For testing purposes, I updated the file and restarted Hadoop using the appropriate scripts on both master01 and master02. I'm able to start Scala and R sessions in the notebook now, although not all the sessions which I tried were successful, due to the Hadoop limitations on ARM memory model. You mentioned that the best approach is to rebuild Hue as the properties can't be easily overridden post installation.
Andy, thanks for much for helping with this! I will close the issue, unless you have extra comments.
Awesome, have fun and I'll close the issue
On a RP cluster with 2 masters and 3 workers, after running the startup scripts, all the pages/features of Hue are working, excepting the notebook. This cannot restore the Scala and PySpark sessions (new notebook or Sample notebook scenarios). The spark-shell and pyspark can be started from command prompt and both work. This does not fix the issue, though.
Screen shots:
Logs: hue-logs-1492758362.44.zip livy-hduser-server.zip