Backend sometimes down since yesterday, then almost everything is down

MaartenLMEM commented 3 years ago

Since yesteday we have some periods (minutes) where nothing is working : Cannot access backend, profiles pages on website, bubbles in extension....

It looks like it's related to a backend issue 52X

gregoirelacoste commented 3 years ago

@MaartenLMEM do you remember at what time did you observe this issue yesterday ?

gregoirelacoste commented 3 years ago

some issues: [fastcgi:error] [pid 1260:tid 140242090718784] [client 46.252.181.103:54306] FastCGI: incomplete headers (0 bytes) received from server "/var/www/php5-fpm/php5.external" there were a lot of: ool php: DIGEST-MD5 common mech free I don't see anything more

MaartenLMEM commented 3 years ago

Yesterday Wednesday March 24th happened minimum from 3:51 PM (see slack message from Johan) and some minutes after (I was going to share the issue and I saww his message)
Today thursday March 23th between 11:48 and 11:57 AM @gregoirelacoste

gregoirelacoste commented 3 years ago

@lutangar @MaartenLMEM i did : clever accesslogs --after 2021-03-24T15:00:00 --before 2021-03-24T16:00:00 -F json | jq .sT | sort | uniq -c | sort -nr > export.json

The first column is numb of occurrence...

It give :

304 "No+Content" 249 "Gateway+Timeout" 202 "OK" 50 "Moved+Permanently" 46 "Unprocessable+Entity" 40 "Request+Timeout" 34 "Internal+Server+Error" 1 "Not+Modified"

and: clever accesslogs --after 2021-03-23T09:00:00 --before 2021-03-23T12:00:00 -F json | jq .sT | sort | uniq -c | sort -nr > export.json

1944 "No+Content" 919 "OK" 394 "Unprocessable+Entity" 33 "Internal+Server+Error" 17 "Not+Modified" 7 "Not+Found" 6 "Found" 2 "Moved+Permanently"

I don't see anything in particular

gregoirelacoste commented 3 years ago

Since @MaartenLMEM notifed us fem minutes ago, i am looking the logs during the down time. At every access tentative to notices.bulles.fr i see:

2021-03-29T10:36:48.803Z: [fastcgi:error] [pid 1261:tid 139797451097664] [client 185.42.117.109:38258] FastCGI: incomplete headers (0 bytes) received from server "/var/www/php5-fpm/php5.external"
2021-03-29T10:36:48.803Z: [fastcgi:error] [pid 1261:tid 139797451097664] (104)Connection reset by peer: [client 185.42.117.109:38258] FastCGI: comm with server "/var/www/php5-fpm/php5.external" aborted: read failed
2021-03-29T10:38:20.965Z: [fastcgi:error] [pid 1261:tid 139797351884352] (104)Connection reset by peer: [client 185.42.117.108:39410] FastCGI: comm with server "/var/www/php5-fpm/php5.external" aborted: read failed
2021-03-29T10:38:20.965Z: [fastcgi:error] [pid 1261:tid 139797351884352] [client 185.42.117.108:39410] FastCGI: incomplete headers (0 bytes) received from server "/var/www/php5-fpm/php5.external"

Do you have explanation ideas? @lutangar @JalilArfaoui

gregoirelacoste commented 3 years ago

@lutangar there is /var/log/php-fpm.log and /var/log/apache2 on clever cloud ssh, but permission is denied for me, maybe for you it's working ?

MaartenLMEM commented 3 years ago

Today 31/03.2021 at 12:28 Backoffice did not work anymore. At now at 12:35 he is still blocked.

It blocks also profile pages (12:50).

JalilArfaoui commented 3 years ago

Sorry for the delay !

I take a look !

gregoirelacoste commented 3 years ago

before rebuild there are at least 100: [fastcgi:error] [pid 1266:tid 139846165173824] [client 185.42.117.108:58674] FastCGI: incomplete headers (0 bytes) received from server "/var/www/php5-fpm/php5.external"

gregoirelacoste commented 3 years ago

@JalilArfaoui if you find more details logs, tell me please!

JalilArfaoui commented 3 years ago

@lutangar If I understand well, the issue happenned just after you raised memory_limit from 512 Mo to 2 Go, strangely.

It seems that you changed that setting only on production … is there any reason it’s not on master ?

Maybe we allowed PHP to consume more memory than available ? I’m digging

JalilArfaoui commented 3 years ago

Backend being unreachable does not seem to be related to missing memory :

JalilArfaoui commented 3 years ago

It seems to have started with commit 441bda58

JalilArfaoui commented 3 years ago

I’ve rolled back to bf8d2162 while we’re still investigating …

JalilArfaoui commented 3 years ago

@gregoirelacoste @christpet @MaartenLMEM please tell if you experience any new unavailability … I’m monitoring and digging on my side

JalilArfaoui commented 3 years ago

It’s been fine for 34 hours, so I’ve pushed commit 453ea38bf02e8ac99c758e2db49ebe4a0ecb0acc (12/01/2021) in production.

Please tell if you experience any new unavailability

JalilArfaoui commented 3 years ago

It started breaking again (10 times in less than 2 hours, outside of office hours) …

I’m going back to 9beb2a34052f3010d5fd56669358af1491d08d4a

JalilArfaoui commented 3 years ago

My bad, 453ea38bf02e8ac99c758e2db49ebe4a0ecb0acc was last commit on master, whether I wanted to push 1f38e6ceae9edd552e65b88579b77ee54a4b3097 …

It is now 1f38e6ceae9edd552e65b88579b77ee54a4b3097 in production

Let’s wait and see …

JalilArfaoui commented 3 years ago

1f38e6c is broken too …

I’ve just got back to 98ae85fdce4e7a7dc37359dde673ecabad523a29

JalilArfaoui commented 3 years ago

Sentry confirms that it’s an OutOfMemoryError on https://sentry.io/organizations/lmem/issues/?project=1404898&query=is%3Aunresolved%20url%3A%22http%3A%2F%2Fnotices.bulles.fr%2Fapi%2Fv3%2Fcontributors%22

https://sentry.io/organizations/lmem/issues/2312489126/

JalilArfaoui commented 3 years ago

We have 94 contributors on production, 56 on staging

JalilArfaoui commented 3 years ago

I’m narrowing my suspicions to commit bcd773c97fbaf10b154989f4960c14dac90f1701

It’s sending a lot more data in contribution.example for each contributor …

before

after

@lutangar : Was it meant ? It seems that we only use the url field
@lutangar @gregoirelacoste Why are we still using deprecated example field instead of its replacement pinnedNotices in the extension (just in fetchContributorFeaturedNoticeSaga it seams) ?
@lutangar @gregoirelacoste I think it’s time to think about including only notices ids in contributors fetching, and let the extension fetch notices info when needed … or we go GraphQL (or Vulcain) 😁

I’m waiting to see if it’s stable now and have a confirmation of my suspicion …

Then as a hotfix, I will try pushing master + bcd773c97fbaf10b154989f4960c14dac90f1701 revert to production

gregoirelacoste commented 3 years ago

@JalilArfaoui thx ! I don't know for the others fields but we need at least created in pinnedNotices GraphQL :+1:

lutangar commented 3 years ago

GraphQL (or Vulcain)

Both! https://github.com/dunglas/vulcain/blob/main/docs/graphql.md

JalilArfaoui commented 3 years ago

@gregoirelacoste the reason you saw nothing in logs is because Sentry was enabled, but quota was depleted, so all our errors where swallowed :-/

lutangar commented 3 years ago

the reason you saw nothing in logs is because Sentry was enabled, but quota was depleted, so all our errors where swallowed :-/

omg again :facepalm:

JalilArfaoui commented 3 years ago

@lutangar we’re just back on paid subscription, for the record

lutangar commented 3 years ago

@lutangar : Was it meant ? It seems that we only use the url field

Not really no, but from my point of these fields aren't the source of the issue, I'll look for a recursion somewhere instead or at least a longer serie of nested objects.

@lutangar @gregoirelacoste Why are we still using deprecated example field instead of its replacement pinnedNotices in the extension (just in fetchContributorFeaturedNoticeSaga it seams) ?

example field was in fact kind of a different usecase than pinnedNotices. But yes we can decide to drop it and adapt the behavior accordingly.

@lutangar @gregoirelacoste I think it’s time to think about including only notices ids in contributors fetching, and let the extension fetch notices info when needed …

I'd say we need both depending on the context, new backend API must supports these kind of features, yes.

My suspicion is on the NoticeNormlizer, there is another fix upper in the tree for this. I stumble on it a few times and I remember some strange side effects.

[NormalizerOptions::INCLUDE_CONTRIBUTORS_DETAILS => false]),

JalilArfaoui commented 3 years ago

I'll look for a recursion somewhere instead

I though that too, except the same code is working OK on staging … But yes maybe …

I’ll finish with this test first to see how it behaves, and then dig deeper …

JalilArfaoui commented 3 years ago

My suspicion is on the NoticeNormlizer, there is another fix upper in the tree for this. I

I’ll check that too

lutangar commented 3 years ago

I have second thoughts about this: https://github.com/dis-moi/backend/blob/441bda58690b4fdd49cdc629d0ddaf90a96d50e2/src/Serializer/NoticeNormalizer.php#L75

christpet commented 3 years ago

@lutangar are we waiting for @JalilArfaoui to come back or is there something you could do in the meantime?

JalilArfaoui commented 3 years ago

@christpet I'm just waiting a few hours to determine if 98ae85fd is stable, and then move forward

christpet commented 3 years ago

@JalilArfaoui ok super thanks for the update!

JalilArfaoui commented 3 years ago

https://github.com/dis-moi/backend/commit/98ae85fdce4e7a7dc37359dde673ecabad523a29 is OK

I’ve just pushed master + revert of bcd773c to production

JalilArfaoui commented 3 years ago

One week after : only 2 restarts …

So I confirm what I suspected in https://github.com/dis-moi/backend/issues/335#issuecomment-812367812

I opened a PR for the revert #341

I’ll open an issue with possible optimizations

dis-moi / backend

Backend sometimes down since yesterday, then almost everything is down #335