Limit max_instances to 1 for the scheduler to prevent conflicts

djwooten commented 4 months ago

PR Checklist

[x] This comment contains a description of changes (with reason)
[x] Referenced issue is linked
[ ] If you've fixed a bug or added code that should be tested, add tests!
[ ] Documentation in docs is updated
[ ] docs/changelog.md is updated

Description of changes

Fixes #526

MegaQC ingests reports via an asynchronous scheduler job. If multiple scheduler jobs run concurrently, they can create conflicts. This change prevents that from happening be setting APScheduler's max_instances parameter to 1.

multimeric commented 4 months ago

Thanks for the issue and the PR. However, just limiting the number of workers doesn't seem like the ideal solution to me. I wonder if we can just edit the processing logic to use transactions and/or to just skip rows that have the TREATED status, so that this conflict doesn't happen, but retaining the multiple workers approach?

djwooten commented 4 months ago

I can see how setting max_instances=1 doesn't necessarily scale well, since if there are so many reports that they are not processed within the 30 second period, then having multiple workers process them could help speed things up.

I don't really know enough about databases in general, or flask-sqlalchemy in particular to know whether there could still be some edge cases slipping through if it switches to a transaction model, nor if you would still get the full benefit of the parallel workers.

Perhaps one option would be to set all rows' status to IN TREATMENT before processing any of the individual reports, since I think it is the report processing and uploading that is the slow part. Something like

# Mark all current rows as "IN TREATMENT" to prevent other workers from processing these rows
for row in queued_uploads:
    row.status = "IN TREATMENT"
    db.session.add(row)
db.session.commit()

# Upload the contents of each report to the database
for row in queued_uploads:
    # ... continue with the remainder of the original loop

I would expect that most realistic scenarios would be able to finish the first loop before another worker kicks off 30 seconds later, though it's not a strict guarantee.

MultiQC / MegaQC

Limit max_instances to 1 for the scheduler to prevent conflicts #529