aws / efs-utils

Utilities for Amazon Elastic File System (EFS)
MIT License
287 stars 186 forks source link

Occasional "Connection reset by peer" errors when mounting #32

Open jonfryd opened 5 years ago

jonfryd commented 5 years ago

Hi guys,

We are using efs-utils from within Docker containers spawned from AWS Batch. It works great, but occasionally we receive this error about 26 seconds after attempting to mount EFS over TLS:

mount.nfs4: Connection reset by peer

We are using the recommended mount command:

mount -t efs -o tls [EFS file system ID]:/ /mnt

This happens in ~0.2% of all mount attempts from all our VPCs. It's a particularly nasty issue because it seems to prevent the mount process from being killed cleanly. Since Apache Commons Exec 1.3's basic ExecuteWatchdog is not able to destroy it, the only remedy I have found is to terminate the EC2 instance.

Any ideas or insights would be greatly appreciated.

Thanks!

Cheers, -Jon

okeeffes commented 5 years ago

Thanks for the feedback. We’ll have someone take a look and post a more detailed response once we better understand the issue.

kunupat commented 5 years ago

We are facing the same issue with AWS Batch and EFS. Any updates/workaround would be greatly appreciated. @okeeffes @jonfryd Thanks- Kunal

jonfryd commented 5 years ago

This is still an issue for sure.

It has occurred three times today on production. We are using the latest version 1.12 of efs-utils.

medinadato commented 4 years ago

I had the same issue. I kept getting "mount.nfs4: Connection reset by peer". After getting sure my EFS instances had the same security group as the EC2 ones it worked fine.

andresdelgadillo commented 4 years ago

Hi there, is any update so far? I noticed the same issue sometimes, I am using the latest version amazon-efs-utils 1.27.1

vibhor13 commented 2 years ago

Any updates please it's affecting critical production systems .

gsfraley commented 2 years ago

Seem to have run into the same issue -- it looks like sometimes it's not able to resolve the DNS name of the EFS endpoint at boot time (Amazon Linux 2 here). The issue is that it never appears to recover automatically, even when DNS is up later in the boot -- though manually running the mount command works successfully.

Output in the /var/log/amazon/efs/mount.log:

INFO - Starting TLS tunnel: "/usr/bin/stunnel /var/run/efs/stunnel-config.<REDACTED>"
INFO - Started TLS tunnel, pid: 2532
INFO - Starting TLS tunnel: "/usr/bin/stunnel /var/run/efs/stunnel-config.<REDACTED>"
INFO - Starting TLS tunnel: "/usr/bin/stunnel /var/run/efs/stunnel-config.<REDACTED>"
INFO - Started TLS tunnel, pid: 2537
INFO - Started TLS tunnel, pid: 2538
INFO - Executing: "/sbin/mount.nfs4 127.0.0.1:<REDACTED> <REDACTED> -o rw,_netdev,nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,port=20112"
INFO - Executing: "/sbin/mount.nfs4 127.0.0.1:<REDACTED> <REDACTED> -o rw,_netdev,nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,port=20390"
INFO - Executing: "/sbin/mount.nfs4 127.0.0.1:<REDACTED> <REDACTED> -o rw,_netdev,nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,port=20319"
ERROR - Failed to mount <REDACTED>.efs.us-east-1.amazonaws.com at <REDACTED>: returncode=32, stderr="b'mount.nfs4: Connection reset by peer'"
ERROR - Failed to mount <REDACTED>.efs.us-east-1.amazonaws.com at <REDACTED>: returncode=32, stderr="b'mount.nfs4: Connection reset by peer'"
ERROR - Failed to mount <REDACTED>.efs.us-east-1.amazonaws.com at <REDACTED>: returncode=32, stderr="b'mount.nfs4: Connection reset by peer'"

Output from stunnel in the secure log around the same time:

Service [efs] accepted connection from 127.0.0.1:32956
Error resolving '<REDACTED>.efs.us-east-1.amazonaws.com': Neither nodename nor servname known (EAI_NONAME)
No host resolved
Connection reset: 0 byte(s) sent to SSL, 0 byte(s) sent to socket
Service [efs] accepted connection from 127.0.0.1:34048
Service [efs] accepted connection from 127.0.0.1:39250
Error resolving '<REDACTED>.efs.us-east-1.amazonaws.com': Neither nodename nor servname known (EAI_NONAME)
Error resolving '<REDACTED>.efs.us-east-1.amazonaws.com': Neither nodename nor servname known (EAI_NONAME)
No host resolved
Connection reset: 0 byte(s) sent to SSL, 0 byte(s) sent to socket
No host resolved
Connection reset: 0 byte(s) sent to SSL, 0 byte(s) sent to socket
RyanStan commented 1 year ago

For others getting this issue, first, check your security group configuration settings. Creating and managing security groups.

@gsfraley Are you using fstab to automatically mount at runtime? If so, make sure you're specifying _netdev option as well. See Using the EFS mount helper to automatically re-mount EFS file systems with some examples of fstab entries. This makes sure that the filesystem isn't trying to mount before the network systems are up. However, if this isn't the case for you or anyone else running into this issue, please follow up here.

stewartcampbell commented 1 year ago

We see this very occasionally when launching an ECS task on EC2, with two separate EFS mount points for the same EFS volume.

From CloudTrail: Error response from daemon: create xxxxxxxxxxx: VolumeDriver.Create: mounting volume failed: Mount attempt 1/3 failed due to b'mount.nfs4: Connection reset by peer\n', wait 1 sec before next attempt. Mount attempt 2/3 failed due to b'mount.nfs4: Connection reset by peer\n', wait 1 sec before next attempt. b'mount.nfs4: Connection reset by peer'

We are using the latest ECS optimized Amazon Linux 2 AMI. The only customization to efs-utils we have made it to add region = eu-west-1 to the config as we were occasionally seeing issues where the region could not be found, e.g. Error response from daemon: create xxxxxxxxxxx: VolumeDriver.Create: mounting volume failed: Error retrieving region. Please set the \"region\" parameter in the efs-utils configuration file.

Adding the region to the config file solved all those error messages, leaving only the connection reset message above. This happens less than the region error.

stewartcampbell commented 1 year ago

This config looks promising. I'll play with this:

https://github.com/aws/efs-utils/blob/b6151f30684eaacd79f592f5a4b9cf9b9a852ca2/dist/efs-utils.conf#L54-L55

stewartcampbell commented 1 year ago

I forgot to feed back on this. Setting retry_nfs_mount_command_count = 15 in /etc/amazon/efs/efs-utils.conf resolved the issue for us.

We now have an error that occurs even less, but it's still kicking off alerts every week or so, which is annoying: Error response from daemon: create xxx-xxx: VolumeDriver.Create: mounting volume failed: Unsuccessful retrieval of AWS security credentials at http://169.254.170.2/v2/credentials/xxx.

No clue what's causing this one.