Talent-Catalog / talentcatalog

https://tctalent.org
GNU Affero General Public License v3.0
11 stars 4 forks source link

Investigate the recent 100% cpu spike on AWS #794

Closed sadatmalik closed 5 months ago

samschlicht commented 5 months ago

Here are the big spikes last week. I'm focusing on the 100% one but thought I should highlight all, just in case they have anything in common.

Screenshot 2024-03-28 at 1.53.58 pm.png Screenshot 2024-03-28 at 1.54.38 pm.png Screenshot 2024-03-28 at 1.54.27 pm.png
samschlicht commented 5 months ago

Just having a look in Cloudwatch and see that we have an option to switch on Log anomaly detection. Looks like it's free and would add some vigilance for unusual log patterns.

samschlicht commented 5 months ago

This is the instance that went down: \b33e59f50b4f4f049f10a6cc7198f14c

But most of the activity and errors at that time are from the other (\989ce365fac742bd9fd48f542a6047cf) and seem to involve file-upload/form-data (the times you're seeing here allow for Melbourne time difference to UTC):

Screenshot 2024-03-28 at 2.42.11 pm.png Screenshot 2024-03-28 at 2.46.28 pm.png Screenshot 2024-03-28 at 2.47.27 pm.png
samschlicht commented 5 months ago

Candidate #229186 Isamar Tabares was registering at that moment — she appears to have been kicked out of her account because shortly after receiving her thank-you email she requests a password-reset link.

Screenshot 2024-03-28 at 2.53.30 pm.png

Interestingly, she has successfully uploaded an attachment, but only many hours later, and it's a word docx containing images that hasn't fared too well (attached). Perhaps just a coincidence but I'll have a play with some candidate portal upload scenarios.

HOJA DE VIDA ISAMAR CENTER (2).docx

samschlicht commented 5 months ago

This is the last registered activity on the instance that went down, the same file-upload/form-data issue:

Screenshot 2024-03-28 at 3.13.04 pm.png

Here are the principal errors in text form:

samschlicht commented 5 months ago

There's not much admin activity at that time, it looks like an issue with a candidate form submission that included a file upload, but I can't glean much more from what's at hand.

I checked the other two spikes but I don't see similar errors — they appear to relate to admin searches.

samschlicht commented 5 months ago

Just for reference, I was able to reproduce the error in question by uploading a file on the candidate portal and then closing my browser window. But there was no CPU spike, so perhaps that's a red herring!

Screenshot 2024-03-28 at 4.14.31 pm.png

Just curious that it happened so many times around the CPU spike.

samschlicht commented 5 months ago

@sadatmalik I've reviewed this and I think the reference in your ongoing profiling issue is sufficient — at this stage, no reason to think that the fileupload connection is anything more than a one-time coincidence. Closing this one per your suggestion.