brefphp / bref

Serverless PHP on AWS Lambda
https://bref.sh
MIT License
3.12k stars 366 forks source link

Just add a retry before really throwing a FastCgiCommunicationFailed exception. #1758

Closed wysow closed 3 months ago

wysow commented 6 months ago

Hello there!

This PR only to start a discussion to try to find a solution to this kind of random errors:

Error communicating with PHP-FPM to read the HTTP response. Bref will restart PHP-FPM now. Original exception message: hollodotme\FastCGI\Exceptions\ReadFailedException Stream got blocked, or terminated.

I'm pretty sure something smarter can be achieved but manually testing this for a few minutes now and only got 200 responses and no more 500...

wysow commented 6 months ago

Here is something interesting about this: https://github.com/hollodotme/fast-cgi-client/issues/68#issuecomment-1207839792

mnapoli commented 6 months ago

What if the request is updating something in the database, or sending emails for example. That could run the same request/action twice, which might not be a good thing 🤔

How often do you get these errors?

wysow commented 6 months ago

@mnapoli Can't really tell how often it appears on the long run, but when I do some manual testing I sometimes get 50% error rate.... So that's a lot...

I know that retrying is a bit crappy but the fact is that in all the error I get in logs our code is never executed at all, everything happens on Bref side... So for us no problem to do a retry, and then we have a 0% error rate with manual testing.

GrahamCampbell commented 6 months ago

50% error rate smells like something else is borked. Is it always failing after the first invoke?

wysow commented 6 months ago

That's only my own feeling but yes I'm pretty sure it's always after the first invoke...

wysow commented 6 months ago

And to add more context, we did NOT see this behavior on workers or console lambdas (working with Symfony)

wysow commented 6 months ago

Just looked at the numbers of the last days with bref dashboard and I can see a 6-7% error rate on entire days.

mnapoli commented 6 months ago

That is really weird, something else must be at play here. A 6-7% error rate would be affecting all Bref users if that was a global Bref problem.

I'd start looking at ways to pinpoint the problem:

wysow commented 6 months ago

Here a first list of answers, will keep you posted with others answers when I get them:

* extra PHP extensions?

-> Yes, on this project we have redis and mongodb

* out of memory?

-> I'm pretty sure this is not the case as the error is really fast at the execution start (few milliseconds). We are using 2048Mo lambda memory size on this project.

* spawning sub-processes from PHP?

-> this project is an API using bref 8.2 fpm layer and symfony so this is not the case for me here.

* timing out?

-> Like I said the error is really fast at the execution start so not the case either...

* try to see if it happens on a specific HTTP route?

-> will do more testing but I saw it on every HTTP route (GET mainly)

* trying to reproduce with an empty project?

-> Will try and keep you posted.

mnapoli commented 6 months ago

Would be interesting to see too if this happens on cold starts. If not, is the request before successful? Times out? Could fill the memory? (or any other reason it could leave the environment in a broken state)

Also nothing specific/exotic, like using Symfony Runtime, setting a non-standard handler, etc.

wysow commented 6 months ago

Would be interesting to see too if this happens on cold starts. If not, is the request before successful? Times out? Could fill the memory? (or any other reason it could leave the environment in a broken state)

Also nothing specific/exotic, like using Symfony Runtime, setting a non-standard handler, etc.

As far as I manually tested this is not happening on cold starts, and the previous request is always successful. No timeout, not full memory... Nothing visible at least....

wysow commented 6 months ago

This kind of problem is only happening in API mode, so nothing fancy outside of classic Symfony, Symfony Runtime is NOT used in this project.

wysow commented 6 months ago

File php.ini custom in our projet with this content:

extension=intl
wysow commented 6 months ago

Here is the raw log we got when this problem occur:

Error communicating with PHP-FPM to read the HTTP response. Bref will restart PHP-FPM now. Original exception message: hollodotme\FastCGI\Exceptions\ReadFailedException Stream got blocked, or terminated.

WARNING: [pool default] child 19 exited on signal 11 (SIGSEGV) after 424.432414 seconds from start

Here the 424 seconds is really weird as the behavior in an API client is really fast...

wysow commented 6 months ago

File php.ini custom in our projet with this content:

extension=intl

@mnapoli sorry this is not the right php.ini file... Here is the good one:

extension=mongodb
extension=redis
opcache.enable_cli=0

So I'm trying to delete the opcache.enable_cli=0 line right now, will keep you posted.

mnapoli commented 3 months ago

Closing for now as we know that's unfortunately not a solution we can use since it may retry valid requests (non-idempotent) on occasion.