MicrosoftDocs / azure-docs

Open source documentation of Microsoft Azure
https://docs.microsoft.com/azure
Creative Commons Attribution 4.0 International
10.31k stars 21.48k forks source link

Submission of Batch Jobs with Python Code #9373

Closed hsrasheed closed 6 years ago

hsrasheed commented 6 years ago

Is it possible to submit a Livy Spark Batch job that references a python file instead of a jar file? I have tried something of the following, but the job fails: curl -k --user "user:pwd" -v -H "Content-Type: application/json" -X POST --data "{"file": "wasb://test@mystorage.blob.core.windows.net/livybatchtest.py"}" https://mycluster.azurehdinsight.net/livy/batches | python -m json.tool


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

jason-j-MSFT commented 6 years ago

@hr00 Thanks for the feedback! We are currently investigating and will update you shortly.

jason-j-MSFT commented 6 years ago

@hr00 What message does the job fail with?

jason-j-MSFT commented 6 years ago

@hr00 I see references to the Livy batches API and python files here: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component-guide/content/livy-api-batch.html

@nitinme could you consider this a request to add an example of how to make a Livy API call to execute a python file?

hsrasheed commented 6 years ago

@jason-j-MSFT We have used that API to create the requests and have successfully done so for session requests and statements. But when running batch jobs we see exceptions in the application log like the following:

Application application_1526496305873_0051 failed 5 times due to AM Container for appattempt_1526496305873_0051_000005 exited with exitCode: -1000 For more detailed output, check the application tracking page: http://[URL].cx.internal.cloudapp.net:8088/cluster/app/application_1526496305873_0051 Then click on links to logs of each attempt. Diagnostics: File/Folder does not exist: /clusters/[clustername]/user/livy/.sparkStaging/application_1526496305873_0051/pyspark.zip [0460e181-84d4-45a7-903e-31258cf7946d][2018-05-31T09:06:24.2370618-07:00] [ServerRequestId:0460e181-84d4-45a7-903e-31258cf7946d] java.io.FileNotFoundException: File/Folder does not exist: /clusters/[clustername]/user/livy/.sparkStaging/application_1526496305873_0051/pyspark.zip [0460e181-84d4-45a7-903e-31258cf7946d][2018-05-31T09:06:24.2370618-07:00] [ServerRequestId:0460e181-84d4-45a7-903e-31258cf7946d] at sun.reflect.GeneratedConstructorAccessor81.newInstance(Unknown Source)

cassioiks commented 6 years ago

@hr00 you might want to check it here: https://issues.apache.org/jira/browse/SPARK-10795

JasonWHowell commented 6 years ago

@hr00 Please let us know if you need further assistance, or if the tips provided led to a solution. Thanks! Jason

JasonWHowell commented 6 years ago

please-close

We haven't heard back lately, and not sure if this issue is still active. Let us know if we need to continue the discussion.

Azure support could also assist if you are stuck.

Summary: The doc recommended to use the POST /batches and the pyFiles element in the body. pyFiles | Python files to be used in this session | list of strings

Looks like there are a number of quirks with pyspark.zip per the jira noted.

Not sure if this scenario is hitting a bug, or something misconfigured on the cluster, or a problem in the body of the rest call.