OptimalBits / bull

Premium Queue package for handling distributed jobs and messages in NodeJS.
Other
15.51k stars 1.43k forks source link

Queue with Rate Limit 10 requests per second #2777

Open bassamjarad opened 1 week ago

bassamjarad commented 1 week ago

Description

I'm implementing a dispatcher which receives many requests per second. It'll queue them using Bull, then, sends them to another service which is limited to 10 requests per second, however, I noticed while testing that if I send 11 requests per second, it'll actually process the 11 requests in less than 1 second (not following the limit exactly).

queue config:

service.queue = new Queue(serviceName, {
      prefix : 'service',
      limiter: {
        max     : 10,
        duration: 1000,
      },

      redis: {
        port: REDIS_PORT,
        host: REDIS_HOST,
        db  : REDIS_DB,
      },
})

After adding 11 requests to the queue, I logged the processor tasks and they're showing like this:

[2024-10-09T19:59:01.111Z] PROCESS JOB(Service: service1, id: 1) STARTED
[2024-10-09T19:59:01.318Z] PROCESS JOB(Service: service1, id: 1) DONE
[2024-10-09T19:59:01.321Z] PROCESS JOB(Service: service1, id: 2) STARTED
[2024-10-09T19:59:01.330Z] PROCESS JOB(Service: service1, id: 2) DONE
[2024-10-09T19:59:01.332Z] PROCESS JOB(Service: service1, id: 3) STARTED
[2024-10-09T19:59:01.338Z] PROCESS JOB(Service: service1, id: 3) DONE
[2024-10-09T19:59:01.340Z] PROCESS JOB(Service: service1, id: 4) STARTED
[2024-10-09T19:59:01.347Z] PROCESS JOB(Service: service1, id: 4) DONE
[2024-10-09T19:59:01.349Z] PROCESS JOB(Service: service1, id: 5) STARTED
[2024-10-09T19:59:01.354Z] PROCESS JOB(Service: service1, id: 5) DONE
[2024-10-09T19:59:01.377Z] PROCESS JOB(Service: service1, id: 6) STARTED
[2024-10-09T19:59:01.382Z] PROCESS JOB(Service: service1, id: 6) DONE
[2024-10-09T19:59:01.438Z] PROCESS JOB(Service: service1, id: 7) STARTED
[2024-10-09T19:59:01.444Z] PROCESS JOB(Service: service1, id: 7) DONE
[2024-10-09T19:59:01.496Z] PROCESS JOB(Service: service1, id: 8) STARTED
[2024-10-09T19:59:01.503Z] PROCESS JOB(Service: service1, id: 8) DONE
[2024-10-09T19:59:01.558Z] PROCESS JOB(Service: service1, id: 9) STARTED
[2024-10-09T19:59:01.564Z] PROCESS JOB(Service: service1, id: 9) DONE
[2024-10-09T19:59:01.618Z] PROCESS JOB(Service: service1, id: 10) STARTED
[2024-10-09T19:59:01.625Z] PROCESS JOB(Service: service1, id: 10) DONE
[2024-10-09T19:59:02.102Z] PROCESS JOB(Service: service1, id: 11) STARTED
[2024-10-09T19:59:02.113Z] PROCESS JOB(Service: service1, id: 11) FAILED - **Request failed with status code 429**

first job timestamp (start): 2024-10-09T19:59:01.111Z

last job (11th) timestamp (start): 2024-10-09T19:59:02.102Z

2102ms - 1111ms = 991ms

So, job#11 started earlier than it should have.

Bull version

4.16.3

roggervalf commented 1 week ago

hi @bassamjarad I recommend you to migrate to bullmq package where rate limit logic is handled in better way https://docs.bullmq.io/guide/rate-limiting

bassamjarad commented 1 week ago

Thanks @roggervalf for the reply. I've moved to using BullMQ, but I'm afraid I'm still facing the issue. I've collected more info here:

17.514  WORKER PROCESSOR(Service: service1, id: 1) STARTED
17.599  WEBSERVER REQ#1 >> GET /dummy1/get?q=1&age=45 
17.606  WORKER PROCESSOR(Service: service1, id: 1) DONE
17.607  WORKER PROCESSOR(Service: service1, id: 2) STARTED
17.611  WEBSERVER REQ#2 >> GET /dummy1/get?q=1&age=45 
17.612  WORKER PROCESSOR(Service: service1, id: 2) DONE
17.613  WORKER PROCESSOR(Service: service1, id: 3) STARTED
17.615  WEBSERVER REQ#3 >> GET /dummy1/get?q=1&age=45 
17.616  WORKER PROCESSOR(Service: service1, id: 3) DONE
17.617  WORKER PROCESSOR(Service: service1, id: 4) STARTED
17.619  WEBSERVER REQ#4 >> GET /dummy1/get?q=1&age=45 
17.620  WORKER PROCESSOR(Service: service1, id: 4) DONE
17.621  WORKER PROCESSOR(Service: service1, id: 5) STARTED
17.623  WEBSERVER REQ#5 >> GET /dummy1/get?q=1&age=45 
17.624  WORKER PROCESSOR(Service: service1, id: 5) DONE
17.625  WORKER PROCESSOR(Service: service1, id: 6) STARTED
17.627  WEBSERVER REQ#6 >> GET /dummy1/get?q=1&age=45 
17.628  WORKER PROCESSOR(Service: service1, id: 6) DONE
17.629  WORKER PROCESSOR(Service: service1, id: 7) STARTED
17.631  WEBSERVER REQ#7 >> GET /dummy1/get?q=1&age=45 
17.632  WORKER PROCESSOR(Service: service1, id: 7) DONE
17.633  WORKER PROCESSOR(Service: service1, id: 8) STARTED
17.635  WEBSERVER REQ#8 >> GET /dummy1/get?q=1&age=45 
17.636  WORKER PROCESSOR(Service: service1, id: 8) DONE
17.637  WORKER PROCESSOR(Service: service1, id: 9) STARTED
17.639  WEBSERVER REQ#9 >> GET /dummy1/get?q=1&age=45 
17.640  WORKER PROCESSOR(Service: service1, id: 9) DONE
17.641  WORKER PROCESSOR(Service: service1, id: 10) STARTED
17.643  WEBSERVER REQ#10 >> GET /dummy1/get?q=1&age=45 
17.644  WORKER PROCESSOR(Service: service1, id: 10) DONE
18.514  WORKER PROCESSOR(Service: service1, id: 11) STARTED
18.518  WEBSERVER REQ#11 LIMIT HIT 
18.520  WORKER PROCESSOR(Service: service1, id: 11) FAILED - Request failed with status code 429

I've noticed that "WORKER PROCESSOR" (generated from Worker Processor): started for job#1 @17.514 started for job#11 @18.514

so, basically, Worker is processing job#11 after exactly 1 second which is correct. (18.514 - 17.514 = 1000ms)

However, the webserver that receives the requests and handles them ("WEBSERVER") received job#1 @17.599 received job#11 @18.518 (less than 1 second), so it generated the error 429 (too many requests).

so, webserver has received 10 requests in the window between (17.599 - 17.643) and handled them successfully. now, after it receives the 11th request @18.518, it fails which makes sense from webserver's point of view too (18.518 - 17.599 = 919ms) .

any ideas if there's a solution to such problem.

roggervalf commented 1 week ago

hey it doesn't looks like there is an issue on bullmq side as these 10 jobs are picked at the time period, but at the time period where these api calls are made as you already mentioned. If you are receiving 429 errors even using a rate limit, you can consider on using https://docs.bullmq.io/guide/rate-limiting#manual-rate-limit our manual rate limit. Usually these errors comes with a remaining ttl header response that you can use