Ingest gives only 503 in newest version 1.5.0

Rungutan / sentry-fargate-cf-stack

AWS CloudFormation template to launch a highly-available Sentry 20 stack through ECS Fargate at the minimum cost possible

Apache License 2.0

60 stars 16 forks source link

Ingest gives only 503 in newest version 1.5.0 #11

Closed nodomain closed 3 years ago

nodomain commented 3 years ago

Hi,

I completely re-setup everything from scratch but the latest version now only gives a 503 on the ingest. Hence no events are captured.

Thanks for an idea.

mariusmitrofan commented 3 years ago

Stack finished successfully but ingestion throws 503?

If that's the case, then setting all images to tag 1.4.0 might be a good workaround for now...

If not, you're going to have to check ingestion logs, specifically cloudwatch logs from the ecs containers for "relay" in the web ecs cluster.

I'll start a stack myself in the meantime from scratch to double check.

PS: you might want to check those logs before updating the images anyway...

nodomain commented 3 years ago

Yes, stack finished but 503. Checking the 1.4.0 workaround right now - I'll leave the Clickhouse to v.1.5.0 right?

nodomain commented 3 years ago

Ingest back up and running.

mariusmitrofan commented 3 years ago

Awesome. Will check the compatibility again with latest release from sentry.

Cheers

mariusmitrofan commented 3 years ago

Release 1.6.0 fixed the ingest error @nodomain !

nodomain commented 3 years ago

Cool will have a Look. I set up a staging stack now to not break production. Meanwhile the prod environment just stopped processing new events. I’ll check with 1.6.0 then.

nodomain commented 3 years ago

And as a side note: perhaps you could please use squash commits when merging into main, this would make things easier to understand if there would be less commits.

mariusmitrofan commented 3 years ago

Cool will have a Look. I set up a staging stack now to not break production. Meanwhile the prod environment just stopped processing new events. I’ll check with 1.6.0 then.

Might want to check your monitoring data.

You might be DDos-ing yourself. If that's the case either modify the sample traces or increase the instance sizes for redis/postgres

nodomain commented 3 years ago

Thanks for the hint. I now resized redis and rds accordingly, let’s see if that brings it back to live :-)

-- Sent from my mobile device. Please excuse brevity and spelling errors.

Am 23.01.2021 um 16:00 schrieb Marius Mitrofan notifications@github.com:

Cool will have a Look. I set up a staging stack now to not break production. Meanwhile the prod environment just stopped processing new events. I’ll check with 1.6.0 then.

Might want to check your monitoring data.

You might be DDos-ing yourself. If that's the case either modify the sample traces or increase the instance sizes for redis/postgres

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

nodomain commented 3 years ago

Redis Memory was the culprit. 100% full and swapping. I resized it and am now waiting for new events to come in.

nodomain commented 3 years ago

Nothing changed. I also stopped all tasks in the workers group in ECS to have them restarted. Any other tips from your experience? No real errors in the logs :-/

mariusmitrofan commented 3 years ago

When that happened to me it was due to poor definition if client dsn.

Check the "installation instructions" for a random project and try to submit an exception.

And go from there...

PS: remember about that SSL problem for ingest record mentioned in the readme

nodomain commented 3 years ago

After resizing MSK as well, things seem to come back to live. Good learning experience :) Using Sentry for CSP report endpoint for a high traffic site does not co-exist with small t* instances... Furtermore the 1.6.0 upgrade went smoothly.

Thanks for your great efforts!