brefphp / bref

Serverless PHP on AWS Lambda
https://bref.sh
MIT License
3.11k stars 367 forks source link

Connection refused while trying to join unix:///tmp/.bref/php-fpm.sock #220

Closed italo1983 closed 5 years ago

italo1983 commented 5 years ago

I'm using Slim framework with bref v0.3 and mysql extension loaded to query on Aurora Serverless Layer: arn:aws:lambda:eu-central-1:416566615250:layer:php-72-fpm:4 Some endpoints using mysql query are returning 502 with this error: Fatal error: Uncaught Hoa\Socket\Exception\Exception: Client returns an error (number 111): Connection refused while trying to join unix:///tmp/.bref/php-fpm.sock. in /var/task/vendor/hoa/socket/Client.php:191 How i can solve this error??

mnapoli commented 5 years ago

Could you try with the latest runtimes: https://runtimes.bref.sh/ (watch out the whole name has changed, not just the version)

Once you have tried: do you have any other logs before that?

Could you also try to run locally to see if you have other results? (see https://bref.sh/docs/local-development.html)

italo1983 commented 5 years ago

I've solved the problem, seems that Slim does not terminate the request when the database does not respond so the fpm socket is busy and does not work (return Internal server error) until the function restart. After applying the right permissions to the lambda group the function started to work perfectly! Tested on brief 0.3 and Slim 2 with php-fpm72:4 Thank you for the great work on this great project!

mnapoli commented 5 years ago

Oh that's interesting.

Maybe what we could try is have the PHP timeout (max_execution_time) at 29 seconds, and the lambda will be at 30 seconds.

That way we are certain the PHP process will always be shutdown before the lambda shuts down.

I'll keep this issue open to remember to implement that.

jt-technologies commented 5 years ago

I've solved the problem, seems that Slim does not terminate the request when the database does not respond so the fpm socket is busy and does not work (return Internal server error) until the function restart. After applying the right permissions to the lambda group the function started to work perfectly! Tested on brief 0.3 and Slim 2 with php-fpm72:4 Thank you for the great work on this great project!

@italo1983 : Can you describe what permissions you added to your lambda?

mnapoli commented 5 years ago

Moving my comment here from https://github.com/mnapoli/bref/issues/214#issuecomment-460558637

OK thank you, that makes sense then. We should modify the PHP max execution time to 29 seconds then.

Any other idea to address this (e.g. if users change the API Gateway timeout below 30 seconds) is welcome. Maybe we could use the context variable (not available in Bref in the current version) to determine how much time is left and update the PHP max execution time on the fly.

jt-technologies commented 5 years ago

To set the max_execution_time to 29 seconds did not work for me. As long as this lambda is warm, I'm not able to get it work again, even the actions with no database connectionm - which worked fine before this error.

mnapoli commented 5 years ago

OK this is good to know @j-tec. Just to be sure did you set 29 seconds in php.ini (not template.yaml)?

jt-technologies commented 5 years ago

Yes, and i can confirm that the php.ini is loaded with 29 seconds max_execution_time.

deleugpn commented 5 years ago

I'm getting this one at the moment.

Fatal error: Uncaught Hoa\Socket\Exception\Exception: Client returns an error (number 111): Connection refused while trying to join unix:///tmp/.bref/php-fpm.sock. in /var/task/vendor/hoa/socket/Client.php:191
22:43:01 Stack trace:
22:43:01 #0 /var/task/vendor/hoa/stream/Stream.php(219): Hoa\Socket\Client->_open('unix:///tmp/.br...', NULL)
22:43:01 #1 /var/task/vendor/hoa/stream/Stream.php(297): Hoa\Stream\Stream::_getStream('unix:///tmp/.br...', Object(Hoa\Socket\Client), NULL)
22:43:01 #2 /var/task/vendor/hoa/stream/Stream.php(178): Hoa\Stream\Stream->open()
22:43:01 #3 /var/task/vendor/hoa/socket/Connection/Connection.php(197): Hoa\Stream\Stream->__construct('unix:///tmp/.br...', NULL)
22:43:01 #4 /var/task/vendor/hoa/fastcgi/Responder.php(183): Hoa\Socket\Connection\Connection->connect()
22:43:01 #5 /var/task/vendor/mnapoli/bref/src/Runtime/PhpFpm.php(102): Hoa\Fastcgi\Responder->send(Array, '')
22:43:01 #6 /opt/bootstrap(26): Bref\Runtime\PhpFpm->proxy(Array)
22:43:01 #7 /var/task/vendor/mnapoli/bref/src/Runtime/LambdaRuntime.php(58): {closure}(Array)
22:43:01 #8 /opt/bootstrap(27): Bref\Runtime\LambdaRuntime->processNextEvent(Object(Closure))
deleugpn commented 5 years ago

Turns out the problem does seem to be related to some process hanging longer than the maximum timeout. For me it was because of MySQL access. Since PDO takes longer than 30 seconds to timeout and the Lambda was outside the VPC (without access to Aurora), the fpm process would just hang in the first run and never work. By deploying a new lambda inside the VPC, the problem went away because the MySQL timeout went away.

gonzalovilaseca commented 5 years ago

@deleugpn I'm facing the same issue, I've added the RDS VPC to the Lambda config, and redeployed the function but issue is still there, did you have to do anything else? When I added the VPC I got this message:

When you enable a VPC, your Lambda function loses default internet access. If you require external internet access for your function, make sure that your security group allows outbound connections and that your VPC has a NAT gateway.

Did you enable outbound connection?

(My function doesn't need it)

What Policies does your lambda Role have?

deleugpn commented 5 years ago

Here's my lambda:

  Api:
    Type: AWS::Serverless::Function
    Properties:
      Handler: public/index.php
      Runtime: provided
      Layers:
        - arn:aws:lambda:eu-west-1:209497400698:layer:php-72-fpm:2
      CodeUri: ./
      MemorySize: 1024
      Timeout: 30
      Role: !ImportValue LambdaExecutionRoleArn
      Policies:
        - AWSLambdaFullAccess
      VpcConfig:
        SecurityGroupIds: [!ImportValue AllowAllAddressesContainerSecurityGroup]
        SubnetIds: !Split [',', !ImportValue PrivateSubnets]
      Events:
        HttpRoot:
          Type: Api
          Properties:
            Path: /
            Method: ANY
        HttpSubPaths:
          Type: Api
          Properties:
            Path: /{proxy+}
            Method: ANY

Here's my Lambda Role. Note that if your lambda doesn't interact with S3 or SQS, you only need the EC2 for VPC ENI provisioning:

  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          Effect: Allow
          Principal:
            Service: lambda.amazonaws.com
          Action: sts:AssumeRole
      Policies:
      - PolicyName: allowLambdaLogs
        PolicyDocument:
          Version: '2012-10-17'
          Statement:
          - Effect: Allow
            Action:
            - logs:*
            Resource: arn:aws:logs:*:*:*
      - PolicyName: allowSqs
        PolicyDocument:
          Version: '2012-10-17'
          Statement:
          - Effect: Allow
            Action:
            - sqs:ReceiveMessage
            - sqs:DeleteMessage
            - sqs:GetQueueAttributes
            - sqs:ChangeMessageVisibility
            - s3:Put*
            - ec2:CreateNetworkInterface
            - ec2:DescribeNetworkInterfaces
            - ec2:DeleteNetworkInterface
            Resource:
            - '*'

Here's the Lambda Security Group. Note that Egress is what the message is talking about. Allowing the lambda to go out to the internet. If you don't need inboud, feel free to erase the Ingress.

  ContainerSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Allow HTTP from Load Balancer
      SecurityGroupEgress:
        - CidrIp: 0.0.0.0/0
          FromPort: '-1'
          IpProtocol: '-1'
          ToPort: '-1'
      SecurityGroupIngress:
        - FromPort: '80'
          IpProtocol: tcp
          ToPort: '80'
          SourceSecurityGroupId: !Ref GeneralPurposeLoadBalancerSecurityGroup
      VpcId: !ImportValue Vpc
gonzalovilaseca commented 5 years ago

Thanks, fixing the VPC made it for me.

viezel commented 5 years ago

I have the same issue with a laravel app. I have a dead simple app without use of database or caching. Same behaviour: first request timeouts, then i get:

Client returns an error (number 111): Connection refused while trying to join unix:///tmp/.bref/php-fpm.sock.

Im running within a VPC but with correct permissions. Im running Bref 0.3.4  

mnapoli commented 5 years ago

A summary of what I've tried.


First of all max_execution_time doesn't seem to change anything because of this bit:

The set_time_limit() function and the configuration directive max_execution_time only affect the execution time of the script itself. Any time spent on activity that happens outside the execution of the script such as system calls using system(), stream operations, database queries, etc. is not included when determining the maximum time that the script has been running. This is not true on Windows where the measured time is real.

Since we are waiting here on a MySQL connection that doesn't count in the time limit. So setting max_execution_time=29 (to be under the API Gateway and Lambda timeout) doesn't work.


PHP-FPM's request_terminate_timeout may be a solid alternative to this. The downside is that it can't be set via php.ini, so it can't be customized (at the moment) in the lambda (in case someone wants less than 30s).

I've given it a quick try but every subsequent request (after a FPM timeout) fails:

Hoa\Socket\Exception\BrokenPipe: Pipe is broken, cannot write data. in /var/task/vendor/hoa/socket/Connection/Connection.php:853
Stack trace:
#0 /var/task/vendor/hoa/socket/Connection/Connection.php(961): Hoa\Socket\Connection\Connection->write('\x01\x01\x00\x01\x00\x08\x00\x00\x00\x01\x00\x00\x00\x00\x00...', 1352)
#1 /var/task/vendor/hoa/fastcgi/Responder.php(228): Hoa\Socket\Connection\Connection->writeAll('\x01\x01\x00\x01\x00\x08\x00\x00\x00\x01\x00\x00\x00\x00\x00...')
#2 /var/task/src/Runtime/PhpFpm.php(109): Hoa\Fastcgi\Responder->send(Array, '')
...

Could it be that the FPM connection is left in an inconsistent state? Or that Hoa/FastCGI cannot handle this correctly?


Regardless on all this I wanted to change the behavior of the lambda so that if the FPM connection is broken the lambda stops and a new one starts.

The goal would be to avoid having a broken lambda in the pool.

I tried this, but even if the bootstrap script dies (exit(1)) Lambda will still keep the instance and try to reboot bootstrap.

I even tried to signal an initialization failure to the runtime HTTP API and that doesn't change anything: Lambda still keeps the same lambda instance and tries to start bootstrap again.

I'm wondering how we can recover from this. Maybe:

?

nealio82 commented 5 years ago

PHP-FPM's request_terminate_timeout may be a solid alternative to this. The downside is that it can't be set via php.ini, so it can't be customized (at the moment) in the lambda (in case someone wants less than 30s).

Are we able to get the configured Lambda timeout setting from the environment and configure request_terminate_timeout in the bootstrap?

Oh, apparently not: https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html

nealio82 commented 5 years ago

I tried this, but even if the bootstrap script dies (exit(1)) Lambda will still keep the instance and try to reboot bootstrap.

I think this should be ok? Wouldn't it return the application to a fresh state before where it died (ie, FPM gets restarted)? Or do subsequent requests still die afterwards?

nealio82 commented 5 years ago

Hoa\Socket\Exception\BrokenPipe can we catch this and restart the php-fpm process? Seems to me like that would be the simplest / quickest thing to be able to try first?

mnapoli commented 5 years ago

Are we able to get the configured Lambda timeout setting from the environment and configure request_terminate_timeout in the bootstrap?

Even better we can get the time left for the lambda to execution (in the context), so that should do the trick. But request_terminate_timeout is configured in php-fpm.conf, I don't know if those flags can be set at runtime on the fly…

I tried this, but even if the bootstrap script dies (exit(1)) Lambda will still keep the instance and try to reboot bootstrap.

I think this should be ok? Wouldn't it return the application to a fresh state before where it died (ie, FPM gets restarted)? Or do subsequent requests still die afterwards?

Requests still die, php-fpm.sock still exists, and it's highly likely that the previous php-fpm process is still running with its worker. We could probably kill it and restart php-fpm as you suggest, but I'm not 100% certain how reliable it is (will the child worker correctly be terminated as well?).

mnapoli commented 5 years ago

I did some tests:

That means that we can indeed restart PHP-FPM properly. To be continued…

viezel commented 5 years ago

Interesting news @mnapoli

nealio82 commented 5 years ago

That sounds promising!

viezel commented 5 years ago

have anyone successfully added a VPC to their Bref based app? no matter what I do, it just times out. im not trying to connect to any aws services yet.

gonzalovilaseca commented 5 years ago

Check that VPC allows all external connections, I had an issue when I reused a VPC that had traffic restricted from some IPs

viezel commented 5 years ago

I did that. It accepts port 80 and 443

deleugpn commented 5 years ago

I have this successfully running from within a VPC and with access to Aurora. Not sure how I could help, though.

viezel commented 5 years ago

so @deleugpn you have a VPC, with 3 public facing subnets and a security group that has port 80 and 443, right? can you screenshot your SAM yaml ?

mnapoli commented 5 years ago

@viezel could we keep this issue on the original topic. It's already a massive thread, and it's a very complex issue it's getting hard to follow on ;)

slootjes commented 5 years ago

I also ran into the same error while experimenting with Aurora and the Lambda running in the same VPC. For testing purposes I created a MySQL RDS instance which is publicly accessible (I've verified that this works by connecting from it through my local machine). My Lambda is now running without a VPC and it should be able to access the RDS. Still I keep getting this error though...

Trying this with Drupal 8.6.10 (uses Symfony HttpKernel) and the PHP-FPM 7.3 layer. How can I help?

italo1983 commented 5 years ago

I've already fixed this problem using RDS as service and Amazon S3. What you need:

Remember to check security rules and enable all output connections and also add the right policies to the lambda user otherwise can't connect to RDS/S3

slootjes commented 5 years ago

@italo1983 I experience this issue without using a VPC, a NAT gateway won't solve anything then.

For others experiencing the same issue as me: when creating a public RDS the security group only makes it accessible from your own IP. To fix this go to the SG of the RDS and add the IP range you need. Warning: using a public RDS is usually a very bad idea, use it at your own risk and for testing purposes only.

mnapoli commented 5 years ago

The root of this issue is because of a timeout. Those timeouts sometimes cause PHP-FPM to be in a broken state.

You can get a timeout because trying to access a VPC without the correct permissions, trying to connect to any API with a long timeout, or even (probably) with a sleep(60).

The goal is to fix the root of the issue which is: better handle timeouts like this with a clear error message.

I have created #256: "Document using RDS (MySQL/PostgreSQL) via a VPC". Feel free to discuss topics related to VPC and MySQL in there.

Please let's keep this issue related to PHP-FPM and move the rest of the discussion to #256. That will help moving Bref forward on all fronts :)

mnapoli commented 5 years ago

Related pull request to try to move this thing forward: #257

Still problems along the way 😞

mnapoli commented 5 years ago

I think I've found a solution to deal with timeouts properly:

That will not solve the RDS connection problems mentioned here, but the goal is to help all of us identify what went wrong.

Here is the error message that should appear in CloudWatch now, hopefully it will be much more helpful to understand what's wrong:

capture d ecran 2019-03-06 a 20 21 52