Open djwooten opened 4 months ago
Thanks for the issue and the PR. However, just limiting the number of workers doesn't seem like the ideal solution to me. I wonder if we can just edit the processing logic to use transactions and/or to just skip rows that have the TREATED status, so that this conflict doesn't happen, but retaining the multiple workers approach?
I can see how setting max_instances=1
doesn't necessarily scale well, since if there are so many reports that they are not processed within the 30 second period, then having multiple workers process them could help speed things up.
I don't really know enough about databases in general, or flask-sqlalchemy
in particular to know whether there could still be some edge cases slipping through if it switches to a transaction model, nor if you would still get the full benefit of the parallel workers.
Perhaps one option would be to set all rows' status to IN TREATMENT
before processing any of the individual reports, since I think it is the report processing and uploading that is the slow part. Something like
# Mark all current rows as "IN TREATMENT" to prevent other workers from processing these rows
for row in queued_uploads:
row.status = "IN TREATMENT"
db.session.add(row)
db.session.commit()
# Upload the contents of each report to the database
for row in queued_uploads:
# ... continue with the remainder of the original loop
I would expect that most realistic scenarios would be able to finish the first loop before another worker kicks off 30 seconds later, though it's not a strict guarantee.
PR Checklist
docs
is updateddocs/changelog.md
is updatedDescription of changes
Fixes #526
MegaQC ingests reports via an asynchronous scheduler job. If multiple scheduler jobs run concurrently, they can create conflicts. This change prevents that from happening be setting
APScheduler
'smax_instances
parameter to1
.