Closed wysow closed 5 months ago
Here is something interesting about this: https://github.com/hollodotme/fast-cgi-client/issues/68#issuecomment-1207839792
What if the request is updating something in the database, or sending emails for example. That could run the same request/action twice, which might not be a good thing 🤔
How often do you get these errors?
@mnapoli Can't really tell how often it appears on the long run, but when I do some manual testing I sometimes get 50% error rate.... So that's a lot...
I know that retrying is a bit crappy but the fact is that in all the error I get in logs our code is never executed at all, everything happens on Bref side... So for us no problem to do a retry, and then we have a 0% error rate with manual testing.
50% error rate smells like something else is borked. Is it always failing after the first invoke?
That's only my own feeling but yes I'm pretty sure it's always after the first invoke...
And to add more context, we did NOT see this behavior on workers or console lambdas (working with Symfony)
Just looked at the numbers of the last days with bref dashboard and I can see a 6-7% error rate on entire days.
That is really weird, something else must be at play here. A 6-7% error rate would be affecting all Bref users if that was a global Bref problem.
I'd start looking at ways to pinpoint the problem:
Here a first list of answers, will keep you posted with others answers when I get them:
* extra PHP extensions?
-> Yes, on this project we have redis and mongodb
* out of memory?
-> I'm pretty sure this is not the case as the error is really fast at the execution start (few milliseconds). We are using 2048Mo lambda memory size on this project.
* spawning sub-processes from PHP?
-> this project is an API using bref 8.2 fpm layer and symfony so this is not the case for me here.
* timing out?
-> Like I said the error is really fast at the execution start so not the case either...
* try to see if it happens on a specific HTTP route?
-> will do more testing but I saw it on every HTTP route (GET mainly)
* trying to reproduce with an empty project?
-> Will try and keep you posted.
Would be interesting to see too if this happens on cold starts. If not, is the request before successful? Times out? Could fill the memory? (or any other reason it could leave the environment in a broken state)
Also nothing specific/exotic, like using Symfony Runtime, setting a non-standard handler, etc.
Would be interesting to see too if this happens on cold starts. If not, is the request before successful? Times out? Could fill the memory? (or any other reason it could leave the environment in a broken state)
Also nothing specific/exotic, like using Symfony Runtime, setting a non-standard handler, etc.
As far as I manually tested this is not happening on cold starts, and the previous request is always successful. No timeout, not full memory... Nothing visible at least....
This kind of problem is only happening in API mode, so nothing fancy outside of classic Symfony, Symfony Runtime is NOT used in this project.
File php.ini
custom in our projet with this content:
extension=intl
Here is the raw log we got when this problem occur:
Error communicating with PHP-FPM to read the HTTP response. Bref will restart PHP-FPM now. Original exception message: hollodotme\FastCGI\Exceptions\ReadFailedException Stream got blocked, or terminated.
WARNING: [pool default] child 19 exited on signal 11 (SIGSEGV) after 424.432414 seconds from start
Here the 424 seconds is really weird as the behavior in an API client is really fast...
File
php.ini
custom in our projet with this content:extension=intl
@mnapoli sorry this is not the right php.ini
file... Here is the good one:
extension=mongodb
extension=redis
opcache.enable_cli=0
So I'm trying to delete the opcache.enable_cli=0
line right now, will keep you posted.
Closing for now as we know that's unfortunately not a solution we can use since it may retry valid requests (non-idempotent) on occasion.
Hello there!
This PR only to start a discussion to try to find a solution to this kind of random errors:
I'm pretty sure something smarter can be achieved but manually testing this for a few minutes now and only got 200 responses and no more 500...