davidhwyllie / findNeighbour4

A server delivering large scale, incrementable, bacterial relatedness monitoring
MIT License
3 stars 2 forks source link

with gunicorn, if a web worker times out, the insert semaphore can be left set #89

Closed davidhwyllie closed 3 years ago

davidhwyllie commented 3 years ago

When running fn4 with gunicorn, multiple web workers are instantiated. each has a timeout period (default 30 sec; increased to 90sec at present).

if a process, including a process inserting samples, crashes or does not respond, gunicorn will kill it and restart another. if the process is inserting, this can result in

davidhwyllie commented 3 years ago

The frequency with which this happens is very low - perhaps once in 400,000 inserts, using SARS-CoV-2 test data. However, a robust mechanism for auto-restarting and for ensuring the database is left in a consistent state is required.

davidhwyllie commented 3 years ago

The proposed changes are as follows:

  1. Add the guid of the sample being inserted to the FN4LOCK table when the lock is acquired. This will require changes to the PERSIST apis and the database schema, but these will be very minor.
  2. Write a new process which checks every 60 seconds whether a long-lasting lock is in place. If it has been in place for > 90 seconds, take corrective actions
    • ensure fn4 and catwalk are in sync
    • ensure database is up to date
    • release lock