BlueBrain / accounting-service

Apache License 2.0
0 stars 0 forks source link

Ensure concurrency safety when running multiple instances #54

Open GianlucaFicarelli opened 5 days ago

GianlucaFicarelli commented 5 days ago

At the moment we deploy only one instance of the service, but in the future we may want to allow to start multiple instances to scale horizontally.

Before doing that we should ensure that the service works correctly even in that case.

In particular:

  1. alembic migration: it's executed when the container is started, before starting uvicorn. If multiple containers are started at the same time there can be a race condition that could cause the migration to fail. Possible solutions:
    1. Ensure that the container running the migration acquires a lock. The other containers will wait, then skip the migration because the db is already updated.
    2. Run the migration as a step of the CI executing the deployment
    3. Use other mechanisms for leader election
    4. Run the migration manually (I would avoid that if possible)
  2. tasks: the container runs some tasks:
    1. queue_consumers (they consume messages from the oneshot, longrun, storage queues)
    2. job_chargers (they run periodically and charge the user for the running or finished uncharged oneshot, longrun, storage jobs)

(migrated from https://bbpteam.epfl.ch/project/issues/browse/NSETM-2332)

GianlucaFicarelli commented 5 days ago
  1. Alembic migration, using an exclusive transaction-level advisory lock: https://github.com/BlueBrain/accounting-service/pull/50
  2. Job charger tasks, using a lock on the task row in the task_registry table: https://github.com/BlueBrain/accounting-service/pull/49
  3. Queue consumer tasks: a lock shouldn't be needed, since the data are retrieved from the queues, and SQS ensures that the messages are processed in order for each group (if using appropriate message group IDs). See:
    1. From https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/using-messagegroupid-property.html: _MessageGroupId is the tag that specifies that a message belongs to a specific message group. Messages that belong to the same message group are always processed one by one, in a strict order relative to the message group (however, messages that belong to different message groups might be processed out of order)._
    2. From https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/interleaving-multiple-ordered-message-groups.html: To interleave multiple ordered message groups within a single FIFO queue, use message group ID values (for example, session data for multiple users). In this scenario, multiple consumers can process the queue, but the session data of each user is processed in a FIFO manner. When messages that belong to a particular message group ID are invisible, no other consumer can process messages with the same message group ID.

So the requirements for 3 are: