If in the case that an user accidentally deletes a job attachment from the job attachments bucket (or clears the job attachments bucket completely), if a job is running that depends on any of the deleted job attachments, the job should fail at the syncInputJobAttachments stage since the job attachment is not able to be synced from the bucket to the worker.
We should test that this is indeed what happens, and that the worker is still able to process other jobs after this failure.
What was the solution? (How)
Add a test that
Submits a job with input job attachments in SUSPENDED status, this will ensure that the input job attachments are uploaded to the Job Attachments bucket but not synced to the worker yet, as the job has yet to start.
Using the manifest, find any of the hashed input job attachment files in the job attachments bucket, and delete it
Set the job status to READY, ensuring that the worker will start working on the job
Wait for the job to finish, in which case it should be in FAILED status since the syncInputJobAttachments step failed
Check that syncInputJobAttachments session action failed using ListSessionActions
Submit another test job that sleeps, ensuring it passes without fail to verify that the worker is still running as expected.
What is the impact of this change?
Better testing for worker agent code and handling failures
How was this change tested?
# Linux
source .e2e_linux_infra.sh
hatch run e2e-test
Windows
source .e2e_windows_infra.sh
hatch run e2e-test
### Was this change documented?
No
### Is this a breaking change?
No
----
*By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.*
What was the problem/requirement? (What/Why)
If in the case that an user accidentally deletes a job attachment from the job attachments bucket (or clears the job attachments bucket completely), if a job is running that depends on any of the deleted job attachments, the job should fail at the
syncInputJobAttachments
stage since the job attachment is not able to be synced from the bucket to the worker.We should test that this is indeed what happens, and that the worker is still able to process other jobs after this failure.
What was the solution? (How)
Add a test that
syncInputJobAttachments
step failedsyncInputJobAttachments
session action failed usingListSessionActions
What is the impact of this change?
Better testing for worker agent code and handling failures
How was this change tested?
Windows
source .e2e_windows_infra.sh hatch run e2e-test