We use bull for our job queues, which by default, do not automatically retry. What this means is that if something operational causes the job to fail (e.g., the database is in the middle of a restart, etc.), we just give up.
We should retry jobs to make the system more resilient. That's what this PR is about.
In addition, to help us with debugging, I've added matador to the /job-queues route of game so that we can see the status of the various job queues.
Data Model / DB Schema Changes
N/A
Environment / Configuration Changes
npm install is required to install matador
Notes
Different types of jobs may need different settings WRT retries. For example, some jobs may do only database-related stuff, while others might invoke remote APIs. When configuring the jobs to retry, I've used the following "rules" to determine how quickly the job should retry:
if the job hits an external API, it will wait 60s
if the job hits an internal API (such as creating a channel on our echo server), it will wait 10s
if the job doesn't hit any APIs, it will wait 5s before retrying
By default, all jobs will retry 3 times before giving up.
This one is safe. Since the id of the user being added must be unique, there's no chance for it to be added multiple times. And the API calls to GitHub and HubSpot should be idempotent.
Fixes #140.
Does not fix https://github.com/LearnersGuild/game/pull/558, which is a much bigger lift and will require significant refactoring.
Related: https://github.com/LearnersGuild/game/issues/510
Overview
We use bull for our job queues, which by default, do not automatically retry. What this means is that if something operational causes the job to fail (e.g., the database is in the middle of a restart, etc.), we just give up.
We should retry jobs to make the system more resilient. That's what this PR is about.
In addition, to help us with debugging, I've added matador to the
/job-queues
route ofgame
so that we can see the status of the various job queues.Data Model / DB Schema Changes
N/A
Environment / Configuration Changes
npm install
is required to installmatador
Notes
Different types of jobs may need different settings WRT retries. For example, some jobs may do only database-related stuff, while others might invoke remote APIs. When configuring the jobs to retry, I've used the following "rules" to determine how quickly the job should retry:
By default, all jobs will retry 3 times before giving up.