VEuPathDB / lib-compute-platform

Async compute platform core.
1 stars 0 forks source link

Jobs never come back as complete in 1.5.1+ #48

Closed ryanrdoherty closed 1 year ago

ryanrdoherty commented 1 year ago

When using 1.5.2 in eda-compute, I get the following flow using a clean minio+postgres in a local run. The job seems to queue and kick off ok (initial request gives a "in-progress" response as expected (though might expect "queued" first- it must be quick). But then as the plugin is about to run the job, I see this in the logs (earlier stuff added for context):

stack-eda-services-compute-1               | 2023-05-05 15:42:18.313 [rid:     ][jid:][p:] DEBUG S3:287 - marking workspace for job ff8d5d92957c31830df1eb0de693023f as in-progress in s3
stack-eda-services-compute-1               | 2023-05-05 15:42:18.313 [rid:     ][jid:][p:] DEBUG S3WorkspaceFactory:110 - Test path = ff8d5d92957c31830df1eb0de693023f/.workspace
stack-eda-services-compute-1               | 2023-05-05 15:42:18.313 [rid:     ][jid:][p:] DEBUG BucketObjectContainer:47 - Attempting to test whether Bucket{ name='dev-bucket' } contains object 'ff8d5d92957c31830df1eb0de693023f/.workspace'
stack-eda-services-compute-1               | 2023-05-05 15:42:18.328 [rid:     ][jid:][p:] DEBUG BucketObjectContainer:476 - Attempting to touch object 'ff8d5d92957c31830df1eb0de693023f/.in-progress' in Bucket{ name='dev-bucket' }
stack-eda-services-compute-1               | 2023-05-05 15:42:18.330 [rid:17Rf5][jid:][p:] DEBUG BucketObjectContainer:221 - Attempting to get object 'ff8d5d92957c31830df1eb0de693023f/.failed' from Bucket{ name='dev-bucket' }
stack-eda-services-compute-1               | 2023-05-05 15:42:18.333 [rid:17Rf5][jid:][p:] DEBUG BucketObjectContainer:221 - Attempting to get object 'ff8d5d92957c31830df1eb0de693023f/.expired' from Bucket{ name='dev-bucket' }
stack-eda-services-compute-1               | 2023-05-05 15:42:18.337 [rid:17Rf5][jid:][p:] DEBUG BucketObjectContainer:221 - Attempting to get object 'ff8d5d92957c31830df1eb0de693023f/.complete' from Bucket{ name='dev-bucket' }
stack-eda-services-compute-1               | 2023-05-05 15:42:18.339 [rid:17Rf5][jid:][p:] DEBUG BucketObjectContainer:221 - Attempting to get object 'ff8d5d92957c31830df1eb0de693023f/.in-progress' from Bucket{ name='dev-bucket' }
stack-eda-services-compute-1               | 2023-05-05 15:42:18.344 [rid:     ][jid:][p:] DEBUG JobExecutionHandler:37 - executing job ff8d5d92957c31830df1eb0de693023f
stack-eda-services-compute-1               | 2023-05-05 15:42:18.345 [rid:     ][jid:][p:] DEBUG ScratchSpaces:43 - creating local scratch workspace for job ff8d5d92957c31830df1eb0de693023f
stack-eda-services-compute-1               | 2023-05-05 15:42:18.347 [rid:     ][jid:][p:] DEBUG QueueDB:207 - looking up raw job ff8d5d92957c31830df1eb0de693023f
stack-eda-services-compute-1               | 2023-05-05 15:42:18.348 [rid:     ][jid:][p:] DEBUG QueueDB:189 - looking up public job ff8d5d92957c31830df1eb0de693023f
stack-eda-services-compute-1               | 2023-05-05 15:42:18.349 [rid:     ][jid:][p:] DEBUG S3:77 - Fetching workspace ff8d5d92957c31830df1eb0de693023f from S3
stack-eda-services-compute-1               | 2023-05-05 15:42:18.349 [rid:     ][jid:][p:] DEBUG S3WorkspaceFactory:110 - Test path = ff8d5d92957c31830df1eb0de693023f/.workspace
stack-eda-services-compute-1               | 2023-05-05 15:42:18.349 [rid:     ][jid:][p:] DEBUG BucketObjectContainer:47 - Attempting to test whether Bucket{ name='dev-bucket' } contains object 'ff8d5d92957c31830df1eb0de693023f/.workspace'
stack-eda-services-compute-1               | 2023-05-05 15:42:18.353 [rid:     ][jid:][p:] DEBUG BucketObjectContainer:221 - Attempting to get object 'ff8d5d92957c31830df1eb0de693023f/.failed' from Bucket{ name='dev-bucket' }
stack-eda-services-compute-1               | 2023-05-05 15:42:18.356 [rid:     ][jid:][p:] DEBUG BucketObjectContainer:221 - Attempting to get object 'ff8d5d92957c31830df1eb0de693023f/.expired' from Bucket{ name='dev-bucket' }
stack-eda-services-compute-1               | 2023-05-05 15:42:18.358 [rid:     ][jid:][p:] DEBUG BucketObjectContainer:221 - Attempting to get object 'ff8d5d92957c31830df1eb0de693023f/.complete' from Bucket{ name='dev-bucket' }
stack-eda-services-compute-1               | 2023-05-05 15:42:18.360 [rid:     ][jid:][p:] DEBUG BucketObjectContainer:221 - Attempting to get object 'ff8d5d92957c31830df1eb0de693023f/.in-progress' from Bucket{ name='dev-bucket' }
stack-eda-services-compute-1               | 2023-05-05 15:42:18.364 [rid:     ][jid:ff8d5d92957c31830df1eb0de693023f][p:] INFO  PluginExecutor:104 - Executing job ff8d5d92957c31830df1eb0de693023f
stack-eda-services-compute-1               | 2023-05-05 15:42:18.390 [rid:     ][jid:ff8d5d92957c31830df1eb0de693023f][p:] INFO  PluginRegistry:55 - Registering 4 plugins.
stack-eda-services-compute-1               | 2023-05-05 15:42:18.391 [rid:     ][jid:ff8d5d92957c31830df1eb0de693023f][p:example] DEBUG PluginExecutor:114 - loaded plugin provider
stack-eda-services-compute-1               | 2023-05-05 15:42:18.393 [rid:17Rf5][jid:][p:] DEBUG AsyncPlatform:204 - deleting job ff8d5d92957c31830df1eb0de693023f from queue db as it has a different status in S3 than the queue DB: s3.status = in-progress, db.status = queued
stack-eda-services-compute-1               | 2023-05-05 15:42:18.394 [rid:17Rf5][jid:][p:] DEBUG QueueDB:175 - Deleting job ff8d5d92957c31830df1eb0de693023f

We definitely should not be deleting the job here. Then later, after all the data is pulled in from the merge service, the platform notices this issue and aborts.

stack-eda-services-compute-1               | 2023-05-05 15:42:47.328 [rid:     ][jid:ff8d5d92957c31830df1eb0de693023f][p:example] DEBUG PluginExecutor:173 - running plugin
stack-eda-services-compute-1               | 2023-05-05 15:42:47.329 [rid:     ][jid:ff8d5d92957c31830df1eb0de693023f][p:example] INFO  ExamplePlugin:122 - Executing plugin example
stack-eda-services-compute-1               | 2023-05-05 15:42:47.348 [rid:     ][jid:ff8d5d92957c31830df1eb0de693023f][p:example] DEBUG QueueDB:189 - looking up public job ff8d5d92957c31830df1eb0de693023f
stack-eda-services-compute-1               | 2023-05-05 15:42:47.350 [rid:     ][jid:ff8d5d92957c31830df1eb0de693023f][p:example] INFO  JobExecutionHandler:95 - aborting job ff8d5d92957c31830df1eb0de693023f for no longer being in a runnable state (deleted or expired)
stack-eda-services-compute-1               | 2023-05-05 15:42:47.357 [rid:     ][jid:ff8d5d92957c31830df1eb0de693023f][p:example] DEBUG QueueWorker:84 - acknowledging job message 1

A second request for the same job generates the same job ID and returns "in-progress" again (i.e. the job is "stuck"). Logs look like this:

stack-eda-services-compute-1               | 2023-05-05 16:21:01.893 [rid:1e93t][jid:][p:] DEBUG AsyncPlatform:191 - Looking up job ff8d5d92957c31830df1eb0de693023f from either the managed DB or S3
stack-eda-services-compute-1               | 2023-05-05 16:21:01.894 [rid:1e93t][jid:][p:] DEBUG QueueDB:189 - looking up public job ff8d5d92957c31830df1eb0de693023f
stack-eda-services-compute-1               | 2023-05-05 16:21:01.899 [rid:1e93t][jid:][p:] DEBUG S3:77 - Fetching workspace ff8d5d92957c31830df1eb0de693023f from S3
stack-eda-services-compute-1               | 2023-05-05 16:21:01.900 [rid:1e93t][jid:][p:] DEBUG S3WorkspaceFactory:110 - Test path = ff8d5d92957c31830df1eb0de693023f/.workspace
stack-eda-services-compute-1               | 2023-05-05 16:21:01.900 [rid:1e93t][jid:][p:] DEBUG BucketObjectContainer:47 - Attempting to test whether Bucket{ name='dev-bucket' } contains object 'ff8d5d92957c31830df1eb0de693023f/.workspace'
stack-eda-services-compute-1               | 2023-05-05 16:21:01.910 [rid:1e93t][jid:][p:] DEBUG AsyncPlatform:225 - job ff8d5d92957c31830df1eb0de693023f found in S3
stack-eda-services-compute-1               | 2023-05-05 16:21:01.910 [rid:1e93t][jid:][p:] DEBUG BucketObjectContainer:221 - Attempting to get object 'ff8d5d92957c31830df1eb0de693023f/.failed' from Bucket{ name='dev-bucket' }
stack-eda-services-compute-1               | 2023-05-05 16:21:01.916 [rid:1e93t][jid:][p:] DEBUG BucketObjectContainer:221 - Attempting to get object 'ff8d5d92957c31830df1eb0de693023f/.expired' from Bucket{ name='dev-bucket' }
stack-eda-services-compute-1               | 2023-05-05 16:21:01.922 [rid:1e93t][jid:][p:] DEBUG BucketObjectContainer:221 - Attempting to get object 'ff8d5d92957c31830df1eb0de693023f/.complete' from Bucket{ name='dev-bucket' }
stack-eda-services-compute-1               | 2023-05-05 16:21:01.928 [rid:1e93t][jid:][p:] DEBUG BucketObjectContainer:221 - Attempting to get object 'ff8d5d92957c31830df1eb0de693023f/.in-progress' from Bucket{ name='dev-bucket' }
stack-eda-services-compute-1               | 2023-05-05 16:21:01.935 [rid:1e93t][jid:][p:] DEBUG PrometheusFilter:136 - Request end: POST /computes/example 200
Foxcapades commented 1 year ago

Could I get the full log output for compute, or the input used so I can test locally?

ryanrdoherty commented 1 year ago

To test, run a POST against https://edadata-dev.local.apidb.org:8443/computes/example with this JSON:

{
    "studyId":"BRAZIL0001-1",
    "filters":[],
    "derivedVariables":[],
    "config":{
        "outputEntityId":"PCO_0000024",
        "inputVariable":{
            "entityId":"PCO_0000024",
            "variableId":"ENVO_00000009"
        },
        "valueSuffix":"blah"
    }
}
Foxcapades commented 1 year ago

One observation that may or may not be related, but there is a race condition here between these 2 lines:

https://github.com/VEuPathDB/lib-compute-platform/blob/dbeb289f979fc082888b9b1dc405acecc490570e/lib/src/main/kotlin/org/veupathdb/lib/compute/platform/AsyncPlatform.kt#L195

https://github.com/VEuPathDB/lib-compute-platform/blob/dbeb289f979fc082888b9b1dc405acecc490570e/lib/src/main/kotlin/org/veupathdb/lib/compute/platform/AsyncPlatform.kt#L199

The status may change between the queue db check on line 195 and the s3 check on line 199, which could cause this exact issue.

ryanrdoherty commented 1 year ago

Maybe, but it's really consistent. Makes me think it's set up to hit the mismatch in some high % of cases?

Foxcapades commented 1 year ago

If it's this race condition that's biting us it would be because we are calling getJob immediately after submitting the job in submitJob. Maybe I should remove the getJob and just return the queued status regardless of the actual status at time of return

Nevermind, that's an EDA-Compute specific thing, maybe EDA compute shouldn't be calling getJob so quickly... Or maybe the race condition should just be fixed by moving the status updates and checks behind a lock so that they are guaranteed to happen "at once".