At the moment we deploy only one instance of the service, but in the future we may want to allow to start multiple instances to scale horizontally.
Before doing that we should ensure that the service works correctly even in that case.
In particular:
alembic migration: it's executed when the container is started, before starting uvicorn. If multiple containers are started at the same time there can be a race condition that could cause the migration to fail. Possible solutions:
Ensure that the container running the migration acquires a lock. The other containers will wait, then skip the migration because the db is already updated.
Run the migration as a step of the CI executing the deployment
Use other mechanisms for leader election
Run the migration manually (I would avoid that if possible)
tasks: the container runs some tasks:
queue_consumers (they consume messages from the oneshot, longrun, storage queues)
job_chargers (they run periodically and charge the user for the running or finished uncharged oneshot, longrun, storage jobs)
Queue consumer tasks: a lock shouldn't be needed, since the data are retrieved from the queues, and SQS ensures that the messages are processed in order for each group (if using appropriate message group IDs). See:
From https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/interleaving-multiple-ordered-message-groups.html: To interleave multiple ordered message groups within a single FIFO queue, use message group ID values (for example, session data for multiple users). In this scenario, multiple consumers can process the queue, but the session data of each user is processed in a FIFO manner. When messages that belong to a particular message group ID are invisible, no other consumer can process messages with the same message group ID.
So the requirements for 3 are:
assign a proper group id when pushing the messages to the queue (for example, using the virtual lab or the project should work)
At the moment we deploy only one instance of the service, but in the future we may want to allow to start multiple instances to scale horizontally.
Before doing that we should ensure that the service works correctly even in that case.
In particular:
(migrated from https://bbpteam.epfl.ch/project/issues/browse/NSETM-2332)