Azure / aztk

AZTK powered by Azure Batch: On-demand, Dockerized, Spark Jobs on Azure
MIT License
150 stars 66 forks source link

Support Azure Blobs as application resources in Spark #663

Open shtratos opened 6 years ago

shtratos commented 6 years ago

Hello @jafreck @timotheeguerin

Right now AZTK in Spark SDK when aztk.spark.client.Client.submit() is called, it assumes that ApplicationConfiguration contains paths to local files in jars and files fields.

In our case we already have the spark job resources uploaded to Azure Blob Storage so we want to avoid downloading and uploading them again.

From what I see, aztk.spark.client.Client.submit() calls generate_task which uploads files to blob storage, generates ResourceFiles for them, replaces local paths with file names in application config and uploads it as application.yml file to blob storage.

I would like to have an option to provide resource_files directly to Client.submit() and thus skip uploading files.

Right now we use a workaround where we basically reimplement generate_task and generate resource_files for our blobs ourselves. This seems brittle as it is coupled to AZTK SDK implementation and can break when AZTK changes in future.

jafreck commented 6 years ago

I think this is a great feature. We should support both scenarios - local upload and referencing existing files in storage. Thanks for the feature request!