There is actually an ERROR message just a little further down one of the Ultra logs.
[ERROR] KeyError: 'job_definition'
Traceback (most recent call last):
File "/var/task/SDSCode/batch_starter.py", line 399, in lambda_handler
update_status_table(status_params)
File "/var/task/SDSCode/database_handler.py", line 60, in update_status_table
result.job_definition = status_params["job_definition"]
This is coming from within update_status_table
https://github.com/IMAP-Science-Operations-Center/sds-data-manager/blob/d6f2d7c67e[…]108b2a/sds_data_manager/lambda_code/SDSCode/database_handler.py
and we are actually hitting the first path and writing the record in the first log. Then going down the other path in the second log. But then, we are getting a Python exception thrown here because the job_definition key doesn't exist, and that exception kills the lambda and therefore we don't actually get the second batch job somewhat luckily :slightly_smiling_face: So that is why there wasn't two batch jobs kicked off, even though theoretically we should have.
All of this being said, I think we should do multiple changes to address this:
Reduce the number of DB connections we are making and pass the session around to our other functions more often. So request a DB connection right away in the lambda handler and then pass it through. (Right now we are making a session in batch_starter and a second one in update_status_table, and that added a pretty significant delay of ~0.5s)
With the unique constraint in place, switch to a try/except around writing to the DB. A write will be guaranteed to fail if a record already exists and we can catch that exception and know a job already exists. This will avoid the time delay between query / write.
Move to a queue-based job starter system that
@Maxine
is working on.
There is actually an ERROR message just a little further down one of the Ultra logs.
This is coming from within update_status_table https://github.com/IMAP-Science-Operations-Center/sds-data-manager/blob/d6f2d7c67e[…]108b2a/sds_data_manager/lambda_code/SDSCode/database_handler.py and we are actually hitting the first path and writing the record in the first log. Then going down the other path in the second log. But then, we are getting a Python exception thrown here because the job_definition key doesn't exist, and that exception kills the lambda and therefore we don't actually get the second batch job somewhat luckily :slightly_smiling_face: So that is why there wasn't two batch jobs kicked off, even though theoretically we should have. All of this being said, I think we should do multiple changes to address this:
I think all of the first 3 bullets are relatively minor updates and we should probably add them in regardless of the queue system as a backup and best practice, unless the queue system removes the DB work entirely Maxine? database_handler.py https://github.com/[IMAP-Science-Operations-Center/sds-data-manager](https://github.com/IMAP-Science-Operations-Center/sds-data-manager)|IMAP-Science-Operations-Center/sds-data-managerIMAP-Science-Operations-Center/sds-data-manager | Added by GitHub models.py https://github.com/[IMAP-Science-Operations-Center/sds-data-manager](https://github.com/IMAP-Science-Operations-Center/sds-data-manager)|IMAP-Science-Operations-Center/sds-data-managerIMAP-Science-Operations-Center/sds-data-manager | Added by GitHub