Open sastels opened 3 months ago
Extra logging didn't log (but still have crashes). This means the problem is likely in dao_get_jobs_older_than_data_retention
Running in prod:
select count(*), min(created_at), max(created_at)
from jobs
where created_at < now() - INTERVAL '7 DAY'
and archived = false
We see unarchived jobs going back a month:
count min max
482,234 2024-02-07 05:03:45 UTC 2024-02-26 14:50:48 UTC
Probably we have so many unarchived jobs that it's overwhelming and OOMing the query in celery :/ This means that we're not deleting the csv file from s3 when we should. However, as mentioned above, we have a one week expiry period on the bucket so nothing will be more than a week old.
Possible fix would be to have a limit
on the number of unarchived jobs in the query, then add a loop until the query returns zero jobs to be archived.
It looks like we've been at 20K jobs / day for a good while, but in the past month we've crept up to 35K or so per day. So probably using a query limit of 20K should be fine.
@ben851 to review
Reviewed, looked ok, but others had questions. Holding back on approving until the others are satisfied. @jimleroyer can you please take a look?
merged to staging
Let's release to production and see if that issue is getting past. Also, the errors now happen at 5:01 now with the recent time change.
ok, it works, but it's slow. We should be able to batch up the database and s3 operations.
WIP PR https://github.com/cds-snc/notification-api/pull/2139
batching up merged. should:
Archiving 100 jobs
many times. Basically estimate 100 * (# of those lines per minute)There was no release yesterday unfortunately and hence the old code logic did not stop, hence processing all jobs in production. So this is good news as this issue should not surface for the time being. The bad news is that the new code did not run in production in a faster manner.
There was no release yesterday unfortunately and hence the old code logic did not stop, hence processing all jobs in production. So this is good news as this issue should not surface for the time being. The bad news is that the new code did not run in production in a faster manner.
One thing to note is that when I check for the number of pending jobs in the database, there were always 20 of them instead of 0. All scheduled jobs were flushed properly and remaining ones are schedules to be sent in the future.
There were a few errors on staging last night - will investigate if they're related...
yeah... the task has been having a psycopg2.errors.InvalidRowCountInLimitClause
every night exactly since my batching PR merged into staging :confused: reverted the PR out and investigating...
[2024-03-19 09:00:08,212: ERROR/ForkPoolWorker-2] Celery task: remove_sms_email_jobs failed
Traceback (most recent call last):
File "/app/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1910, in _execute_context
self.dialect.do_execute(
File "/app/.venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
cursor.execute(statement, parameters)
File "/app/.venv/lib/python3.10/site-packages/newrelic/hooks/database_psycopg2.py", line 64, in execute
return super(CursorWrapper, self).execute(sql, parameters, *args,
File "/app/.venv/lib/python3.10/site-packages/newrelic/hooks/database_dbapi2.py", line 37, in execute
return self.__wrapped__.execute(sql, parameters,
psycopg2.errors.InvalidRowCountInLimitClause: LIMIT must not be negative
This is rather confusing and hard to test....
Plan:
ok, there's some bad calculations going on here, in particular we have a loop where we could repeatedly add limit
jobs, then down below we compute a limit with limit - len(jobs) which could be negative if we've added too many jobs above.
I'll make a PR to fix these limits. I believe that the reason this wasn't an issue before was because previous 20K limit was high enough that the jobs array didn't get too big. I would have thought it would on prod though? Possibly we don't have services with 3 day retention on prod sending too many jobs.
Just making a few more changes to the PR
Had an issue with code deploying to staging so the fix wasn't in last night. Code properly deployed to staging now.
Did we have any 5:01 crashes? I'll verify.
verified the logs in slack and the dashboard in AWS and the last instance of this error was march 21
Describe the bug
A
core-primary
orcore-scalable
pod in prod is crashing at 4:01 most nights. The celery pod is receiving a SIGKILLBug Severity
SEV-3 doesn't seem to be impacting sending notifications
This is causing a buildup of jobs that aren't being archived. Additionally, the CSV's are not being deleted, as these are tied to this particular job. This would normally raise the severity, but the S3 bucket has an auto-delete to ensure that we meet our privacy restrictions.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Celery should not crash on large queries
Impact
possibly csv files in S3 (including ones with PII) aren't being cleaned up within retention period. Note however the the csv bucket in S3 has a one week lifecycle expiry so we won't have PII older than one week.
QA
Additional context
Getting SIGKILLs(possibly because of this task OOMing over a bad job).
Slack discussion here.