batch feature - Githubissues

exacaster / lighter

REST API for Apache Spark on K8S or YARN

MIT License

91 stars 21 forks source link

batch feature #942

Closed cometta closed 7 months ago

cometta commented 7 months ago

on the lighter ui, I can see a tab call "batch". Can you share any documentation on how to use this feature? Is the job run through jupyter notebook interface or through terminal pyspark code ?

pdambrauskas commented 7 months ago

You can pack your code to a docker image in case of kubernetes or zip in case of yarn and execute it by issuing a http call to the lighter endpoint: https://github.com/exacaster/lighter/blob/master/docs/rest.md#batch

Jupyter does not participate in this in any way. You can start batch applications from any environment or irchestration tool that allows making http requests, like for example Apache Airflow

cometta commented 7 months ago

i see the rest path /lighter/api/batches , for docker image + k8s. is it just define the image path in spark.*container*image key ? pyFiles and files are optional since i'm not using yarn?

Minutis commented 7 months ago

It all depends on the configurations that you started Lighter with, but basically it should be enough to specify spark.kubernetes.container.image when submitting to k8s.

cometta commented 5 months ago

i put my custom python files in my new spark docker image, when i submit job using batch, i get below error

java.lang.RuntimeException: Exception in thread "main" org.apache.spark.SparkException: Please specify spark.kubernetes.file.upload.path property.se/java.lang.Thread.run(Unknown Source)

but i don't need to upload any files. all python files is inside spark.kubernetes.container.image. can advice

Minutis commented 5 months ago

Please provide more information: payload of request to Lighter or just all parameters that are passed to the Lighter.

cometta commented 5 months ago

issue solved with "file": "local://path/my.file.py

cometta commented 5 months ago

Separately, I wanted to enquire if there's a way for me to upload my Python file into an s3a bucket and submit a Spark batch that automatically pulls and runs the Python file, negating the need to rebuild the Docker image?

Minutis commented 5 months ago

It's possible but not recommended. Anything you specify in archives will be downloaded to the nodes. if you specify s3a:// prefix - it should work.