SGrondin / bottleneck

Job scheduler and rate limiter, supports Clustering
MIT License
1.81k stars 74 forks source link

Jobs not running #194

Open arashmortazavi2 opened 3 years ago

arashmortazavi2 commented 3 years ago

I'm using bottleneck (with clustering enabled) on Typescript inside an AWS Lamba function. The library works fine for some time until at some point I start that jobs stop running. I see that at that point the await requestLimiter.schedule() call is blocking until the lambda times out. From then on continues to happen for incoming jobs (they get queued and don't run) for hours until I manually flush the AWS Redis instance.

Here is how I create the bottleneck instance:

this.requestLimiter = new Bottleneck({
        id: `request-limiter`,
        datastore: "redis",
        clearDatastore: false,
        maxConcurrent: 1,
        minTime: 133,
        clientOptions: {
          host: REDIS_HOST
          port: REDIS_PORT
        },
        trackDoneStatus: true
      });

This is what I see in the logs for jobs that get queued but don't execute:

28 Jul 2021 07:09:04.909{
  "pid": 9,
  "level": 30,
  "message": "Event triggered: received",
  "job": [
    {
      "args": [],
      "options": {
        "priority": 5,
        "weight": 1,
        "expiration": null,
        "id": "job1"
      }
    }
  ],
  "msg": "bottleneck debug event",
  "time": "2021-07-28T07:09:04.566Z",
  "v": 0,
  "_level": "info"
}
28 Jul 2021 07:09:04.909{
  "pid": 9,
  "level": 30,
  "message": "Calling Redis script: init.lua",
  "job": [
    1627456144598,
    "qbrslv11ujd",
    0,
    "2.19.5",
    "maxConcurrent",
    "1",
    "minTime",
    "113",
    "highWater",
    "",
    "strategy",
    "1",
    "penalty",
    "",
    "reservoir",
    "",
    "reservoirRefreshInterval",
    "",
    "reservoirRefreshAmount",
    "",
    "reservoirIncreaseInterval",
    "",
    "reservoirIncreaseAmount",
    "",
    "reservoirIncreaseMaximum",
    "",
    "id",
    "request-limiter",
    "version",
    "2.19.5",
    "groupTimeout",
    "",
    "clientTimeout",
    "10000"
  ],
  "msg": "bottleneck debug event",
  "time": "2021-07-28T07:09:04.598Z",
  "v": 0,
  "_level": "info"
}
28 Jul 2021 07:09:04.909{
  "pid": 9,
  "level": 30,
  "message": "Calling Redis script: register_client.lua",
  "job": [
    1627456144601,
    "qbrslv11ujd",
    0
  ],
  "msg": "bottleneck debug event",
  "time": "2021-07-28T07:09:04.601Z",
  "v": 0,
  "_level": "info"
}
28 Jul 2021 07:09:04.909{
  "pid": 9,
  "level": 30,
  "message": "Calling Redis script: submit.lua",
  "job": [
    1627456144602,
    "qbrslv11ujd",
    "0",
    "1"
  ],
  "msg": "bottleneck debug event",
  "time": "2021-07-28T07:09:04.602Z",
  "v": 0,
  "_level": "info"
}
28 Jul 2021 07:09:04.909{
  "pid": 9,
  "level": 30,
  "message": "Event triggered: queued",
  "job": [
    {
      "args": [],
      "options": {
        "priority": 5,
        "weight": 1,
        "expiration": null,
        "id": "job1"
      },
      "reachedHWM": false,
      "blocked": false
    }
  ],
  "msg": "bottleneck debug event",
  "time": "2021-07-28T07:09:04.603Z",
  "v": 0,
  "_level": "info"
}
28 Jul 2021 07:09:04.909{
  "pid": 9,
  "level": 30,
  "message": "Draining job1",
  "job": {
    "args": [],
    "options": {
      "priority": 5,
      "weight": 1,
      "expiration": null,
      "id": "job1"
    }
  },
  "msg": "bottleneck debug event",
  "time": "2021-07-28T07:09:04.604Z",
  "v": 0,
  "_level": "info"
}
28 Jul 2021 07:09:04.909{
  "pid": 9,
  "level": 30,
  "message": "Calling Redis script: register.lua",
  "job": [
    1627456144604,
    "qbrslv11ujd",
    "8998iaiwo2j",
    "1",
    ""
  ],
  "msg": "bottleneck debug event",
  "time": "2021-07-28T07:09:04.604Z",
  "v": 0,
  "_level": "info"
}
28 Jul 2021 07:09:04.909{
  "pid": 9,
  "level": 30,
  "message": "Drained job1",
  "job": {
    "success": false,
    "args": [],
    "options": {
      "priority": 5,
      "weight": 1,
      "expiration": null,
      "id": "job1"
    }
  },
  "msg": "bottleneck debug event",
  "time": "2021-07-28T07:09:04.605Z",
  "v": 0,
  "_level": "info"
}

No other logs for job1 afterward.

What can be the issue here? Any pointers and suggestions would be appreciated. Please let me know if you need any more info. Thanks!

Aravind-React commented 2 years ago

Hi @arashmortazavi2 I face a very similar issue with bottleneck, I had implemented rate limiter for third part calls, I face this issue when ever the TPS is reached its threshold, the queue is getting blocked and not processing any other jobs until I do a manual restart to my instance, Did you fixed this issue..! Below are my configuration. And I haven't set any expiration in the scheduler options.

{ maxConcurrent: 225, minTime: 4, highWater: 450, strategy: 2, clearDatastore: true, datastore: 'redis', clientOptions: { host: '127.0.x.x', port: '6379' }, id: 'my-vendor' }

Thanks.

vikram741 commented 1 year ago

I have encountered a similar bottleneck problem and figured out the root cause. In my case, I was using the bottleneck to rate limit API calls to a third-party API. My implementation has two problems. First, my nodejs application runs on the same device in multiple instances with each having its own bottleneck instance. Because of this, each instance can make calls simultaneously and exceeds the rate limit ( This can be fixed by making the bottleneck distributed using redis ). When I get a rate limit error, I retry using the same wrapped function with await. So, a job is created in a currently running job and pushed to the queue. The current job can't execute because it is waiting for the retry job to complete. The retry job can't complete because the current job is still running. This creates a circular dependency and a DEADLOCK. I resolve the deadlock by calling the function directly instead of the wrapped function ( this time it is not queued ). Hope this helps someone in the future.

dkbhadeshiya commented 8 months ago

Hi all,

We are also facing the same issue with our bottleneck instance with redis as well. It starts processing the tasks for some time, and then randomly after a while it stops processing any tasks and gets stuck there. We have to manually flush the redis keys for bottleneck in order to bring the processing back up.

Does anyone has any idea on how to fix this issue?

Response from @vikram741 doesn't help as that is not the case in our implementation. We are doing all the things by the documentation and still this is happening. We are using Redis to have a distributed limiter, and we are not even retrying a failed task.