aws-deadline / deadline-cloud

Multi-purpose library and command line tool that implements functionality to support applications using AWS Deadline Cloud.
Apache License 2.0
22 stars 25 forks source link

Bug: Deadline Config File Issue on Bundle Submission #386

Open pta200 opened 6 days ago

pta200 commented 6 days ago

Expected Behaviour

Execute Deadline Cloud submit job bundle in parallel to speed up job submission process without generating an cli errors.

Current Behaviour

When trying to submit twenty job bundles in parallel batches of five jobs the deadline cli starts throwing errors. It appears that after each job submissions deadline cli writes the job id to the .deadline/config file. As such when submitting jobs in parallel there is likely contention when updating the resulting in an issue where all the values for farm_id, queue_id, and storage_profile_id are missing. As such the next job fails. A work around is to use the submit command parameters e.g "--farm-id" etc.... , but storage_profile_id is not included as parameter so any jobs needing to upload a file can't be automated ask it triggers a prompt.

Reproduction Steps

Ensure the .deadline/config is correctly setup with a profile, farm_id, queue_id, and storage_profile_id . Then use openjd.model to generate a job bundle and ProcessPoolExecutor to submit those jobs in parallel in batches of 5 calling the deadline cli from a python subprocess. e.g. "deadline bundle submit --yes -p InFile=/tmp/test_script.py /tmp/tmpy2f5jdu8". Here the cli is using the config to know what farm/queue to submit the bundle.

Sample .deadline/config file:

[telemetry]
identifier = 6b09b2cf-d296-4355-a125-d73a4233067c

[deadline-cloud-monitor]
path = /opt/DeadlineCloudMonitor/deadline-cloud-monitor_1.1.2_amd64.AppImage

[defaults]
aws_profile_name = test-us-east-1

[profile-test-us-east-1 defaults]
farm_id = farm-XXXXX

[profile-test-us-east-1 farm-XXX defaults]
queue_id = queue-XXXX

[profile-test-us-east-1 farm-XXXXX settings]
storage_profile_id = sp-XXXXXXX

[profile-test-us-east-1 farm-XXXX queue-XXXXXX defaults]
job_id = job-d9093dc0ece34453a69e73246c9d8e43

Eventually you'll get some version of a CalledProcessError when the deadline cli fails to submit the job e.g. raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'deadline bundle submit --yes -p InFile=/tmp/test_script.py /tmp/tmpy2f5jdu8' returned non-zero exit status 1.

When you look at the config file it now reads as follows with all the other configurations missing and only the last successfully submitted job. As such no further job submission works unless you use the options in the cli or fix the config file.

[profile-(default)   defaults]
job_id = job-3d75277939134f4e82fff8669398196d

Code Snippet

with ProcessPoolExecutor(max_workers=5) as executor: futures = set() for x in range(20): futures.add(executor.submit(submit_job) if (x+1) % 5 == 0: done, futures = wait(futures, return_when=ALL_COMPLETED) logger.info("next batch....") futures.clear()

epmog commented 5 days ago

Hey thanks for the bug report!

For a little more context, what code path is your submit_job function using? Perhaps strictly an example, but i'll assume it's the CLI based on the other examples/context

There's a few spots where this can pop up, and some allow you to bypass the setting.

My hunch here is that if it's an interactive submission with defaults then we should set the value to make it easier for users to inspect their job submissions. Otherwise if we're doing batch/background operations we should ignore updating it.