Closed ryanrdoherty closed 1 year ago
Could I get the full log output for compute, or the input used so I can test locally?
To test, run a POST against https://edadata-dev.local.apidb.org:8443/computes/example with this JSON:
{
"studyId":"BRAZIL0001-1",
"filters":[],
"derivedVariables":[],
"config":{
"outputEntityId":"PCO_0000024",
"inputVariable":{
"entityId":"PCO_0000024",
"variableId":"ENVO_00000009"
},
"valueSuffix":"blah"
}
}
One observation that may or may not be related, but there is a race condition here between these 2 lines:
The status may change between the queue db check on line 195 and the s3 check on line 199, which could cause this exact issue.
Maybe, but it's really consistent. Makes me think it's set up to hit the mismatch in some high % of cases?
If it's this race condition that's biting us it would be because we are calling getJob
immediately after submitting the job in submitJob
. Maybe I should remove the getJob
and just return the queued status regardless of the actual status at time of return
Nevermind, that's an EDA-Compute specific thing, maybe EDA compute shouldn't be calling getJob so quickly... Or maybe the race condition should just be fixed by moving the status updates and checks behind a lock so that they are guaranteed to happen "at once".
When using 1.5.2 in eda-compute, I get the following flow using a clean minio+postgres in a local run. The job seems to queue and kick off ok (initial request gives a "in-progress" response as expected (though might expect "queued" first- it must be quick). But then as the plugin is about to run the job, I see this in the logs (earlier stuff added for context):
We definitely should not be deleting the job here. Then later, after all the data is pulled in from the merge service, the platform notices this issue and aborts.
A second request for the same job generates the same job ID and returns "in-progress" again (i.e. the job is "stuck"). Logs look like this: