Thinking about a solution for this, I tried to consider the following:
Solves the memory issue in a reasonable way without adding complexity.
Doesn't require a ton of changes to the library.
Doesn't cause complete flushes of usage stats at random times resulting in invalid timing data when viewing your job stats.
Doesn't introduce new performance issues.
The solution I came up with here is to simply select a threshold for the number of jobs we want to sample timing data for and once we cross that threshold, safely remove the oldest (very arbitrary) 1/4 of the timing values in the list. Note that this does not affect any job status counts whatsoever, but rather the average, max, min, and total processing time displayed with each job.
If this sounds concerning consider that:
This type of roll-up already occurs when you simply view the Sidekiq statistics dashboard. If you run a bunch of jobs, visit the dashboard, and note the Time(sec) value for your job. This is supposed to be total run time of all instances of that job. Run a few more jobs and refresh, you'll see these timing metrics basically start to count again from zero.
The default is 250,000 jobs of the most recent jobs of any specific job type, is scoped at the job level, and is configurable. It's a fairly large default that shouldn't be hit for most people, and will consume max about 3.5MB per job type per day, which IMO is pretty reasonable.
Through some local and anecdotal testing, pushing 500,000 numbers into a list in Redis via lpush as we do via redis.lpush "#{worker_key}:timeslist", status[:time] consumes about 7 MB. Doesn't seem like much but for those having issues, pushing > 2M jobs per day, that's a minimum of 28 MB per day, and in only a few weeks you're over 1 GB of memory usage you won't get back.
I'll add more test results below, but initial testing shows this does fix the issue.
Thanks @davydovanton for this gem, it's the best one I've found for viewing and tracking Sidekiq history.
Trying to address the performance issues noted in https://github.com/davydovanton/sidekiq-statistic/issues/73 and https://github.com/davydovanton/sidekiq-statistic/issues/72.
Thinking about a solution for this, I tried to consider the following:
The solution I came up with here is to simply select a threshold for the number of jobs we want to sample timing data for and once we cross that threshold, safely remove the oldest (very arbitrary) 1/4 of the timing values in the list. Note that this does not affect any job status counts whatsoever, but rather the average, max, min, and total processing time displayed with each job. If this sounds concerning consider that:
This type of roll-up already occurs when you simply view the Sidekiq statistics dashboard. If you run a bunch of jobs, visit the dashboard, and note the
Time(sec)
value for your job. This is supposed to be total run time of all instances of that job. Run a few more jobs and refresh, you'll see these timing metrics basically start to count again from zero.The default is 250,000 jobs of the most recent jobs of any specific job type, is scoped at the job level, and is configurable. It's a fairly large default that shouldn't be hit for most people, and will consume max about 3.5MB per job type per day, which IMO is pretty reasonable.
Through some local and anecdotal testing, pushing 500,000 numbers into a list in Redis via lpush as we do via
redis.lpush "#{worker_key}:timeslist", status[:time]
consumes about 7 MB. Doesn't seem like much but for those having issues, pushing > 2M jobs per day, that's a minimum of 28 MB per day, and in only a few weeks you're over 1 GB of memory usage you won't get back.I'll add more test results below, but initial testing shows this does fix the issue.
Thanks @davydovanton for this gem, it's the best one I've found for viewing and tracking Sidekiq history.