Closed italo1983 closed 5 years ago
Could you try with the latest runtimes: https://runtimes.bref.sh/ (watch out the whole name has changed, not just the version)
Once you have tried: do you have any other logs before that?
Could you also try to run locally to see if you have other results? (see https://bref.sh/docs/local-development.html)
I've solved the problem, seems that Slim does not terminate the request when the database does not respond so the fpm socket is busy and does not work (return Internal server error) until the function restart. After applying the right permissions to the lambda group the function started to work perfectly! Tested on brief 0.3 and Slim 2 with php-fpm72:4 Thank you for the great work on this great project!
Oh that's interesting.
Maybe what we could try is have the PHP timeout (max_execution_time) at 29 seconds, and the lambda will be at 30 seconds.
That way we are certain the PHP process will always be shutdown before the lambda shuts down.
I'll keep this issue open to remember to implement that.
I've solved the problem, seems that Slim does not terminate the request when the database does not respond so the fpm socket is busy and does not work (return Internal server error) until the function restart. After applying the right permissions to the lambda group the function started to work perfectly! Tested on brief 0.3 and Slim 2 with php-fpm72:4 Thank you for the great work on this great project!
@italo1983 : Can you describe what permissions you added to your lambda?
Moving my comment here from https://github.com/mnapoli/bref/issues/214#issuecomment-460558637
OK thank you, that makes sense then. We should modify the PHP max execution time to 29 seconds then.
Any other idea to address this (e.g. if users change the API Gateway timeout below 30 seconds) is welcome. Maybe we could use the
context
variable (not available in Bref in the current version) to determine how much time is left and update the PHP max execution time on the fly.
To set the max_execution_time to 29 seconds did not work for me. As long as this lambda is warm, I'm not able to get it work again, even the actions with no database connectionm - which worked fine before this error.
OK this is good to know @j-tec. Just to be sure did you set 29 seconds in php.ini (not template.yaml)?
Yes, and i can confirm that the php.ini is loaded with 29 seconds max_execution_time.
I'm getting this one at the moment.
Fatal error: Uncaught Hoa\Socket\Exception\Exception: Client returns an error (number 111): Connection refused while trying to join unix:///tmp/.bref/php-fpm.sock. in /var/task/vendor/hoa/socket/Client.php:191
22:43:01 Stack trace:
22:43:01 #0 /var/task/vendor/hoa/stream/Stream.php(219): Hoa\Socket\Client->_open('unix:///tmp/.br...', NULL)
22:43:01 #1 /var/task/vendor/hoa/stream/Stream.php(297): Hoa\Stream\Stream::_getStream('unix:///tmp/.br...', Object(Hoa\Socket\Client), NULL)
22:43:01 #2 /var/task/vendor/hoa/stream/Stream.php(178): Hoa\Stream\Stream->open()
22:43:01 #3 /var/task/vendor/hoa/socket/Connection/Connection.php(197): Hoa\Stream\Stream->__construct('unix:///tmp/.br...', NULL)
22:43:01 #4 /var/task/vendor/hoa/fastcgi/Responder.php(183): Hoa\Socket\Connection\Connection->connect()
22:43:01 #5 /var/task/vendor/mnapoli/bref/src/Runtime/PhpFpm.php(102): Hoa\Fastcgi\Responder->send(Array, '')
22:43:01 #6 /opt/bootstrap(26): Bref\Runtime\PhpFpm->proxy(Array)
22:43:01 #7 /var/task/vendor/mnapoli/bref/src/Runtime/LambdaRuntime.php(58): {closure}(Array)
22:43:01 #8 /opt/bootstrap(27): Bref\Runtime\LambdaRuntime->processNextEvent(Object(Closure))
Turns out the problem does seem to be related to some process hanging longer than the maximum timeout. For me it was because of MySQL access. Since PDO takes longer than 30 seconds to timeout and the Lambda was outside the VPC (without access to Aurora), the fpm process would just hang in the first run and never work. By deploying a new lambda inside the VPC, the problem went away because the MySQL timeout went away.
@deleugpn I'm facing the same issue, I've added the RDS VPC to the Lambda config, and redeployed the function but issue is still there, did you have to do anything else? When I added the VPC I got this message:
When you enable a VPC, your Lambda function loses default internet access. If you require external internet access for your function, make sure that your security group allows outbound connections and that your VPC has a NAT gateway.
Did you enable outbound connection?
(My function doesn't need it)
What Policies does your lambda Role have?
Here's my lambda:
Api:
Type: AWS::Serverless::Function
Properties:
Handler: public/index.php
Runtime: provided
Layers:
- arn:aws:lambda:eu-west-1:209497400698:layer:php-72-fpm:2
CodeUri: ./
MemorySize: 1024
Timeout: 30
Role: !ImportValue LambdaExecutionRoleArn
Policies:
- AWSLambdaFullAccess
VpcConfig:
SecurityGroupIds: [!ImportValue AllowAllAddressesContainerSecurityGroup]
SubnetIds: !Split [',', !ImportValue PrivateSubnets]
Events:
HttpRoot:
Type: Api
Properties:
Path: /
Method: ANY
HttpSubPaths:
Type: Api
Properties:
Path: /{proxy+}
Method: ANY
Here's my Lambda Role. Note that if your lambda doesn't interact with S3 or SQS, you only need the EC2 for VPC ENI provisioning:
LambdaExecutionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
Effect: Allow
Principal:
Service: lambda.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: allowLambdaLogs
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- logs:*
Resource: arn:aws:logs:*:*:*
- PolicyName: allowSqs
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- sqs:ReceiveMessage
- sqs:DeleteMessage
- sqs:GetQueueAttributes
- sqs:ChangeMessageVisibility
- s3:Put*
- ec2:CreateNetworkInterface
- ec2:DescribeNetworkInterfaces
- ec2:DeleteNetworkInterface
Resource:
- '*'
Here's the Lambda Security Group. Note that Egress is what the message is talking about. Allowing the lambda to go out to the internet. If you don't need inboud, feel free to erase the Ingress.
ContainerSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Allow HTTP from Load Balancer
SecurityGroupEgress:
- CidrIp: 0.0.0.0/0
FromPort: '-1'
IpProtocol: '-1'
ToPort: '-1'
SecurityGroupIngress:
- FromPort: '80'
IpProtocol: tcp
ToPort: '80'
SourceSecurityGroupId: !Ref GeneralPurposeLoadBalancerSecurityGroup
VpcId: !ImportValue Vpc
Thanks, fixing the VPC made it for me.
I have the same issue with a laravel app. I have a dead simple app without use of database or caching. Same behaviour: first request timeouts, then i get:
Client returns an error (number 111): Connection refused while trying to join unix:///tmp/.bref/php-fpm.sock.
Im running within a VPC but with correct permissions.
Im running Bref 0.3.4
A summary of what I've tried.
First of all max_execution_time
doesn't seem to change anything because of this bit:
The set_time_limit() function and the configuration directive max_execution_time only affect the execution time of the script itself. Any time spent on activity that happens outside the execution of the script such as system calls using system(), stream operations, database queries, etc. is not included when determining the maximum time that the script has been running. This is not true on Windows where the measured time is real.
Since we are waiting here on a MySQL connection that doesn't count in the time limit. So setting max_execution_time=29
(to be under the API Gateway and Lambda timeout) doesn't work.
PHP-FPM's request_terminate_timeout
may be a solid alternative to this. The downside is that it can't be set via php.ini
, so it can't be customized (at the moment) in the lambda (in case someone wants less than 30s).
I've given it a quick try but every subsequent request (after a FPM timeout) fails:
Hoa\Socket\Exception\BrokenPipe: Pipe is broken, cannot write data. in /var/task/vendor/hoa/socket/Connection/Connection.php:853
Stack trace:
#0 /var/task/vendor/hoa/socket/Connection/Connection.php(961): Hoa\Socket\Connection\Connection->write('\x01\x01\x00\x01\x00\x08\x00\x00\x00\x01\x00\x00\x00\x00\x00...', 1352)
#1 /var/task/vendor/hoa/fastcgi/Responder.php(228): Hoa\Socket\Connection\Connection->writeAll('\x01\x01\x00\x01\x00\x08\x00\x00\x00\x01\x00\x00\x00\x00\x00...')
#2 /var/task/src/Runtime/PhpFpm.php(109): Hoa\Fastcgi\Responder->send(Array, '')
...
Could it be that the FPM connection is left in an inconsistent state? Or that Hoa/FastCGI cannot handle this correctly?
Regardless on all this I wanted to change the behavior of the lambda so that if the FPM connection is broken the lambda stops and a new one starts.
The goal would be to avoid having a broken lambda in the pool.
I tried this, but even if the bootstrap
script dies (exit(1)
) Lambda will still keep the instance and try to reboot bootstrap
.
I even tried to signal an initialization failure to the runtime HTTP API and that doesn't change anything: Lambda still keeps the same lambda instance and tries to start bootstrap
again.
I'm wondering how we can recover from this. Maybe:
/tmp/.bref/php-fpm.sock
socket?
PHP-FPM's request_terminate_timeout may be a solid alternative to this. The downside is that it can't be set via php.ini, so it can't be customized (at the moment) in the lambda (in case someone wants less than 30s).
Are we able to get the configured Lambda timeout setting from the environment and configure request_terminate_timeout
in the bootstrap?
Oh, apparently not: https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html
I tried this, but even if the
bootstrap
script dies (exit(1)
) Lambda will still keep the instance and try to rebootbootstrap
.
I think this should be ok? Wouldn't it return the application to a fresh state before where it died (ie, FPM gets restarted)? Or do subsequent requests still die afterwards?
Hoa\Socket\Exception\BrokenPipe
can we catch this and restart the php-fpm
process? Seems to me like that would be the simplest / quickest thing to be able to try first?
Are we able to get the configured Lambda timeout setting from the environment and configure
request_terminate_timeout
in the bootstrap?
Even better we can get the time left for the lambda to execution (in the context), so that should do the trick. But request_terminate_timeout
is configured in php-fpm.conf
, I don't know if those flags can be set at runtime on the fly…
I tried this, but even if the
bootstrap
script dies (exit(1)
) Lambda will still keep the instance and try to rebootbootstrap
.I think this should be ok? Wouldn't it return the application to a fresh state before where it died (ie, FPM gets restarted)? Or do subsequent requests still die afterwards?
Requests still die, php-fpm.sock
still exists, and it's highly likely that the previous php-fpm
process is still running with its worker. We could probably kill it and restart php-fpm
as you suggest, but I'm not 100% certain how reliable it is (will the child worker correctly be terminated as well?).
I did some tests:
index.php
that stalls (sleep 30
or tries to connect to a MySQL host that doesn't exist)SIGINT
)That means that we can indeed restart PHP-FPM properly. To be continued…
Interesting news @mnapoli
That sounds promising!
have anyone successfully added a VPC to their Bref based app? no matter what I do, it just times out. im not trying to connect to any aws services yet.
Check that VPC allows all external connections, I had an issue when I reused a VPC that had traffic restricted from some IPs
I did that. It accepts port 80 and 443
I have this successfully running from within a VPC and with access to Aurora. Not sure how I could help, though.
so @deleugpn you have a VPC, with 3 public facing subnets and a security group that has port 80 and 443, right? can you screenshot your SAM yaml ?
@viezel could we keep this issue on the original topic. It's already a massive thread, and it's a very complex issue it's getting hard to follow on ;)
I also ran into the same error while experimenting with Aurora and the Lambda running in the same VPC. For testing purposes I created a MySQL RDS instance which is publicly accessible (I've verified that this works by connecting from it through my local machine). My Lambda is now running without a VPC and it should be able to access the RDS. Still I keep getting this error though...
Trying this with Drupal 8.6.10 (uses Symfony HttpKernel) and the PHP-FPM 7.3 layer. How can I help?
I've already fixed this problem using RDS as service and Amazon S3. What you need:
Remember to check security rules and enable all output connections and also add the right policies to the lambda user otherwise can't connect to RDS/S3
@italo1983 I experience this issue without using a VPC, a NAT gateway won't solve anything then.
For others experiencing the same issue as me: when creating a public RDS the security group only makes it accessible from your own IP. To fix this go to the SG of the RDS and add the IP range you need. Warning: using a public RDS is usually a very bad idea, use it at your own risk and for testing purposes only.
The root of this issue is because of a timeout. Those timeouts sometimes cause PHP-FPM to be in a broken state.
You can get a timeout because trying to access a VPC without the correct permissions, trying to connect to any API with a long timeout, or even (probably) with a sleep(60)
.
The goal is to fix the root of the issue which is: better handle timeouts like this with a clear error message.
I have created #256: "Document using RDS (MySQL/PostgreSQL) via a VPC". Feel free to discuss topics related to VPC and MySQL in there.
Please let's keep this issue related to PHP-FPM and move the rest of the discussion to #256. That will help moving Bref forward on all fronts :)
Related pull request to try to move this thing forward: #257
Still problems along the way 😞
I think I've found a solution to deal with timeouts properly:
That will not solve the RDS connection problems mentioned here, but the goal is to help all of us identify what went wrong.
Here is the error message that should appear in CloudWatch now, hopefully it will be much more helpful to understand what's wrong:
I'm using Slim framework with bref v0.3 and mysql extension loaded to query on Aurora Serverless Layer: arn:aws:lambda:eu-central-1:416566615250:layer:php-72-fpm:4 Some endpoints using mysql query are returning 502 with this error: Fatal error: Uncaught Hoa\Socket\Exception\Exception: Client returns an error (number 111): Connection refused while trying to join unix:///tmp/.bref/php-fpm.sock. in /var/task/vendor/hoa/socket/Client.php:191 How i can solve this error??