MultiQC / MegaQC

Web application to collect and visualise data across multiple MultiQC runs.
http://megaqc.info/
GNU General Public License v3.0
95 stars 27 forks source link

Limit max_instances to 1 for the scheduler to prevent conflicts #529

Open djwooten opened 4 months ago

djwooten commented 4 months ago

PR Checklist

Description of changes

Fixes #526

MegaQC ingests reports via an asynchronous scheduler job. If multiple scheduler jobs run concurrently, they can create conflicts. This change prevents that from happening be setting APScheduler's max_instances parameter to 1.

multimeric commented 4 months ago

Thanks for the issue and the PR. However, just limiting the number of workers doesn't seem like the ideal solution to me. I wonder if we can just edit the processing logic to use transactions and/or to just skip rows that have the TREATED status, so that this conflict doesn't happen, but retaining the multiple workers approach?

djwooten commented 4 months ago

I can see how setting max_instances=1 doesn't necessarily scale well, since if there are so many reports that they are not processed within the 30 second period, then having multiple workers process them could help speed things up.

I don't really know enough about databases in general, or flask-sqlalchemy in particular to know whether there could still be some edge cases slipping through if it switches to a transaction model, nor if you would still get the full benefit of the parallel workers.

Perhaps one option would be to set all rows' status to IN TREATMENT before processing any of the individual reports, since I think it is the report processing and uploading that is the slow part. Something like

# Mark all current rows as "IN TREATMENT" to prevent other workers from processing these rows
for row in queued_uploads:
    row.status = "IN TREATMENT"
    db.session.add(row)
db.session.commit()

# Upload the contents of each report to the database
for row in queued_uploads:
    # ... continue with the remainder of the original loop

I would expect that most realistic scenarios would be able to finish the first loop before another worker kicks off 30 seconds later, though it's not a strict guarantee.