Currently GpsBatchJobs._start_job() (non-kube code path) works like:
build an a positional args lists with various batch job related settings, resources, ... and use that to call bash script submit_batch_job_spark3.sh
in submit_batch_job_spark3.sh: parse the arguments again (with some defaults here and there), set some env vars, do some additional logs (e.g. ipa requests) and call spark-submit, again with another long positional argument list that has to correspond with batch_job.py
Some problems of this approach:
the positional argument lists are annoying and error prone to maintain, it's easy to get into invisible off-by-one issues
extra logic in bash is limited to basic stuff but sometimes pretty cryptic and error prone if you're not used to it. And in some cases, (e.g. JSON parsing) we even go back to a python (subprocess) again
proper and maintainable error handling/reporting in bash is near impossible. If something goes wrong now in bash (e.g. ipa request fails), it will be very obscure from the logs. Implementing proper retry logic around failing ipa requests is not something I'd like to do in bash. Same with caching
the bash script is full of hardcoded VITO specific references and resources
how to properly test this brittle logic?
I wonder if we can't eliminate submit_batch_job_spark3.sh and just go directly from GpsBatchJobs._start_job() to a spark-submit subprocess, and do all the argument massaging in python. Making things more maintainable and even testable
Currently
GpsBatchJobs._start_job()
(non-kube code path) works like:args
lists with various batch job related settings, resources, ... and use that to call bash scriptsubmit_batch_job_spark3.sh
submit_batch_job_spark3.sh
: parse the arguments again (with some defaults here and there), set some env vars, do some additional logs (e.g. ipa requests) and callspark-submit
, again with another long positional argument list that has to correspond withbatch_job.py
Some problems of this approach:
I wonder if we can't eliminate submit_batch_job_spark3.sh and just go directly from
GpsBatchJobs._start_job()
to aspark-submit
subprocess, and do all the argument massaging in python. Making things more maintainable and even testable