Closed MaartenLMEM closed 3 years ago
@MaartenLMEM do you remember at what time did you observe this issue yesterday ?
some issues: [fastcgi:error] [pid 1260:tid 140242090718784] [client 46.252.181.103:54306] FastCGI: incomplete headers (0 bytes) received from server "/var/www/php5-fpm/php5.external"
there were a lot of: ool php: DIGEST-MD5 common mech free
I don't see anything more
@lutangar @MaartenLMEM i did : clever accesslogs --after 2021-03-24T15:00:00 --before 2021-03-24T16:00:00 -F json | jq .sT | sort | uniq -c | sort -nr > export.json
The first column is numb of occurrence...
It give :
304 "No+Content" 249 "Gateway+Timeout" 202 "OK" 50 "Moved+Permanently" 46 "Unprocessable+Entity" 40 "Request+Timeout" 34 "Internal+Server+Error" 1 "Not+Modified"
and: clever accesslogs --after 2021-03-23T09:00:00 --before 2021-03-23T12:00:00 -F json | jq .sT | sort | uniq -c | sort -nr > export.json
1944 "No+Content" 919 "OK" 394 "Unprocessable+Entity" 33 "Internal+Server+Error" 17 "Not+Modified" 7 "Not+Found" 6 "Found" 2 "Moved+Permanently"
I don't see anything in particular
Since @MaartenLMEM notifed us fem minutes ago, i am looking the logs during the down time. At every access tentative to notices.bulles.fr i see:
2021-03-29T10:36:48.803Z: [fastcgi:error] [pid 1261:tid 139797451097664] [client 185.42.117.109:38258] FastCGI: incomplete headers (0 bytes) received from server "/var/www/php5-fpm/php5.external"
2021-03-29T10:36:48.803Z: [fastcgi:error] [pid 1261:tid 139797451097664] (104)Connection reset by peer: [client 185.42.117.109:38258] FastCGI: comm with server "/var/www/php5-fpm/php5.external" aborted: read failed
2021-03-29T10:38:20.965Z: [fastcgi:error] [pid 1261:tid 139797351884352] (104)Connection reset by peer: [client 185.42.117.108:39410] FastCGI: comm with server "/var/www/php5-fpm/php5.external" aborted: read failed
2021-03-29T10:38:20.965Z: [fastcgi:error] [pid 1261:tid 139797351884352] [client 185.42.117.108:39410] FastCGI: incomplete headers (0 bytes) received from server "/var/www/php5-fpm/php5.external"
Do you have explanation ideas? @lutangar @JalilArfaoui
@lutangar there is /var/log/php-fpm.log
and /var/log/apache2
on clever cloud ssh, but permission is denied for me, maybe for you it's working ?
Today 31/03.2021 at 12:28 Backoffice did not work anymore. At now at 12:35 he is still blocked.
It blocks also profile pages (12:50).
Sorry for the delay !
I take a look !
before rebuild there are at least 100: [fastcgi:error] [pid 1266:tid 139846165173824] [client 185.42.117.108:58674] FastCGI: incomplete headers (0 bytes) received from server "/var/www/php5-fpm/php5.external"
@JalilArfaoui if you find more details logs, tell me please!
@lutangar If I understand well, the issue happenned just after you raised memory_limit
from 512 Mo to 2 Go, strangely.
It seems that you changed that setting only on production … is there any reason it’s not on master ?
Maybe we allowed PHP to consume more memory than available ? I’m digging
Backend being unreachable does not seem to be related to missing memory :
It seems to have started with commit 441bda58
I’ve rolled back to bf8d2162 while we’re still investigating …
@gregoirelacoste @christpet @MaartenLMEM please tell if you experience any new unavailability … I’m monitoring and digging on my side
It’s been fine for 34 hours, so I’ve pushed commit 453ea38bf02e8ac99c758e2db49ebe4a0ecb0acc (12/01/2021) in production.
Please tell if you experience any new unavailability
It started breaking again (10 times in less than 2 hours, outside of office hours) …
I’m going back to 9beb2a34052f3010d5fd56669358af1491d08d4a
My bad, 453ea38bf02e8ac99c758e2db49ebe4a0ecb0acc was last commit on master, whether I wanted to push 1f38e6ceae9edd552e65b88579b77ee54a4b3097 …
It is now 1f38e6ceae9edd552e65b88579b77ee54a4b3097 in production
Let’s wait and see …
1f38e6c is broken too …
I’ve just got back to 98ae85fdce4e7a7dc37359dde673ecabad523a29
We have 94 contributors on production, 56 on staging
I’m narrowing my suspicions to commit bcd773c97fbaf10b154989f4960c14dac90f1701
It’s sending a lot more data in contribution.example
for each contributor …
url
fieldexample
field instead of its replacement pinnedNotices
in the extension (just in fetchContributorFeaturedNoticeSaga
it seams) ?I’m waiting to see if it’s stable now and have a confirmation of my suspicion …
Then as a hotfix, I will try pushing master
+ bcd773c97fbaf10b154989f4960c14dac90f1701 revert to production
@JalilArfaoui thx !
I don't know for the others fields but we need at least created
in pinnedNotices
GraphQL :+1:
GraphQL (or Vulcain)
Both! https://github.com/dunglas/vulcain/blob/main/docs/graphql.md
@gregoirelacoste the reason you saw nothing in logs is because Sentry was enabled, but quota was depleted, so all our errors where swallowed :-/
the reason you saw nothing in logs is because Sentry was enabled, but quota was depleted, so all our errors where swallowed :-/
omg again :facepalm:
@lutangar we’re just back on paid subscription, for the record
@lutangar : Was it meant ? It seems that we only use the url field
Not really no, but from my point of these fields aren't the source of the issue, I'll look for a recursion somewhere instead or at least a longer serie of nested objects.
@lutangar @gregoirelacoste Why are we still using deprecated example field instead of its replacement pinnedNotices in the extension (just in fetchContributorFeaturedNoticeSaga it seams) ?
example
field was in fact kind of a different usecase than pinnedNotices
. But yes we can decide to drop it and adapt the behavior accordingly.
@lutangar @gregoirelacoste I think it’s time to think about including only notices ids in contributors fetching, and let the extension fetch notices info when needed …
I'd say we need both depending on the context, new backend API must supports these kind of features, yes.
My suspicion is on the NoticeNormlizer
, there is another fix upper in the tree for this. I stumble on it a few times and I remember some strange side effects.
[NormalizerOptions::INCLUDE_CONTRIBUTORS_DETAILS => false]),
I'll look for a recursion somewhere instead
I though that too, except the same code is working OK on staging … But yes maybe …
I’ll finish with this test first to see how it behaves, and then dig deeper …
My suspicion is on the NoticeNormlizer, there is another fix upper in the tree for this. I
I’ll check that too
I have second thoughts about this: https://github.com/dis-moi/backend/blob/441bda58690b4fdd49cdc629d0ddaf90a96d50e2/src/Serializer/NoticeNormalizer.php#L75
@lutangar are we waiting for @JalilArfaoui to come back or is there something you could do in the meantime?
@christpet I'm just waiting a few hours to determine if 98ae85fd is stable, and then move forward
@JalilArfaoui ok super thanks for the update!
https://github.com/dis-moi/backend/commit/98ae85fdce4e7a7dc37359dde673ecabad523a29 is OK
I’ve just pushed master + revert of bcd773c to production
One week after : only 2 restarts …
So I confirm what I suspected in https://github.com/dis-moi/backend/issues/335#issuecomment-812367812
I opened a PR for the revert #341
I’ll open an issue with possible optimizations
Since yesteday we have some periods (minutes) where nothing is working : Cannot access backend, profiles pages on website, bubbles in extension....
It looks like it's related to a backend issue 52X