Prod core celery pod crashes most night at 4:01 / 5:01

sastels commented 3 months ago

Describe the bug

A core-primary or core-scalable pod in prod is crashing at 4:01 most nights. The celery pod is receiving a SIGKILL

Bug Severity

SEV-3 doesn't seem to be impacting sending notifications

This is causing a buildup of jobs that aren't being archived. Additionally, the CSV's are not being deleted, as these are tied to this particular job. This would normally raise the severity, but the S3 bucket has an auto-delete to ensure that we meet our privacy restrictions.

To Reproduce

Steps to reproduce the behavior:

Create 35k jobs in the jobs table to see if we can get celery to crash

Expected behavior

Celery should not crash on large queries

Impact

possibly csv files in S3 (including ones with PII) aren't being cleaned up within retention period. Note however the the csv bucket in S3 has a one week lifecycle expiry so we won't have PII older than one week.

QA

[ ] Should see 5:01 am crashes go away

Additional context

Getting SIGKILLs(possibly because of this task OOMing over a bad job).

Slack discussion here.

sastels commented 3 months ago

Extra logging added here

sastels commented 3 months ago

Extra logging didn't log (but still have crashes). This means the problem is likely in dao_get_jobs_older_than_data_retention

sastels commented 3 months ago

Running in prod:

select count(*), min(created_at), max(created_at) 
from jobs
where created_at <  now() - INTERVAL '7 DAY'
and archived = false

We see unarchived jobs going back a month:

count   min                     max
482,234 2024-02-07 05:03:45 UTC 2024-02-26 14:50:48 UTC

Probably we have so many unarchived jobs that it's overwhelming and OOMing the query in celery :/ This means that we're not deleting the csv file from s3 when we should. However, as mentioned above, we have a one week expiry period on the bucket so nothing will be more than a week old.

Possible fix would be to have a limit on the number of unarchived jobs in the query, then add a loop until the query returns zero jobs to be archived.

sastels commented 3 months ago

It looks like we've been at 20K jobs / day for a good while, but in the past month we've crept up to 35K or so per day. So probably using a query limit of 20K should be fine.

sastels commented 3 months ago

PR to fix https://github.com/cds-snc/notification-api/pull/2127

ben851 commented 3 months ago

@ben851 to review

ben851 commented 3 months ago

Reviewed, looked ok, but others had questions. Holding back on approving until the others are satisfied. @jimleroyer can you please take a look?

sastels commented 3 months ago

merged to staging

jimleroyer commented 3 months ago

Let's release to production and see if that issue is getting past. Also, the errors now happen at 5:01 now with the recent time change.

sastels commented 3 months ago

ok, it works, but it's slow. We should be able to batch up the database and s3 operations.

WIP PR https://github.com/cds-snc/notification-api/pull/2139

sastels commented 3 months ago

batching up merged. should:

check Wednesday on staging to see what happened and revert if something went sideways
check Thursday on prod to see what happened and revert if something went sideways
on prod (or staging tbh) could get a rough idea of how many we are processing per minute and compare to the non-batched 230 / minute we saw before. The code should print out Archiving 100 jobs many times. Basically estimate 100 * (# of those lines per minute)

jimleroyer commented 3 months ago

There was no release yesterday unfortunately and hence the old code logic did not stop, hence processing all jobs in production. So this is good news as this issue should not surface for the time being. The bad news is that the new code did not run in production in a faster manner.

jimleroyer commented 3 months ago

There was no release yesterday unfortunately and hence the old code logic did not stop, hence processing all jobs in production. So this is good news as this issue should not surface for the time being. The bad news is that the new code did not run in production in a faster manner.

jimleroyer commented 3 months ago

One thing to note is that when I check for the number of pending jobs in the database, there were always 20 of them instead of 0. All scheduled jobs were flushed properly and remaining ones are schedules to be sent in the future.

sastels commented 3 months ago

There were a few errors on staging last night - will investigate if they're related...

sastels commented 3 months ago

yeah... the task has been having a psycopg2.errors.InvalidRowCountInLimitClause every night exactly since my batching PR merged into staging :confused: reverted the PR out and investigating...


[2024-03-19 09:00:08,212: ERROR/ForkPoolWorker-2] Celery task: remove_sms_email_jobs failed
Traceback (most recent call last):
  File "/app/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1910, in _execute_context
    self.dialect.do_execute(
  File "/app/.venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
  File "/app/.venv/lib/python3.10/site-packages/newrelic/hooks/database_psycopg2.py", line 64, in execute
    return super(CursorWrapper, self).execute(sql, parameters, *args,
  File "/app/.venv/lib/python3.10/site-packages/newrelic/hooks/database_dbapi2.py", line 37, in execute
    return self.__wrapped__.execute(sql, parameters,
psycopg2.errors.InvalidRowCountInLimitClause: LIMIT must not be negative

sastels commented 3 months ago

This is rather confusing and hard to test....

it involves new relic?
Is the error in the query that we lowered the limit from 20000 to 100?
basically, where do we have a negative limit in a query?

Plan:

leave the reverted code in there for a night and see if we no longer get the error in staging
if we don't, add the PR back in with additional logging to help pinpoint where the problem is in the task

sastels commented 3 months ago

ok, there's some bad calculations going on here, in particular we have a loop where we could repeatedly add limit jobs, then down below we compute a limit with limit - len(jobs) which could be negative if we've added too many jobs above.

I'll make a PR to fix these limits. I believe that the reason this wasn't an issue before was because previous 20K limit was high enough that the jobs array didn't get too big. I would have thought it would on prod though? Possibly we don't have services with 3 day retention on prod sending too many jobs.

sastels commented 3 months ago

Fix https://github.com/cds-snc/notification-api/pull/2143

sastels commented 3 months ago

Just making a few more changes to the PR

sastels commented 3 months ago

Had an issue with code deploying to staging so the fix wasn't in last night. Code properly deployed to staging now.

P0NDER0SA commented 3 months ago

Did we have any 5:01 crashes? I'll verify.

P0NDER0SA commented 3 months ago

verified the logs in slack and the dashboard in AWS and the last instance of this error was march 21

cds-snc / notification-planning-core