awslabs / mountpoint-s3

A simple, high-throughput file client for mounting an Amazon S3 bucket as a local file system.
Apache License 2.0
4.26k stars 144 forks source link

ls: cannot open directory '...': Transport endpoint is not connected #630

Open tchaton opened 7 months ago

tchaton commented 7 months ago

Mountpoint for Amazon S3 version

1.1.1 with caching

AWS Region

us-east-1

Describe the running environment

Running on Amazon EC2

What happened?

This is happening quite frequentally ~ 7/10 for us in our filesystem tests.

ls: cannot open directory `....`: Transport endpoint is not connected

Relevant log output

The only log line I can see is the following.

«2023-11-24T13:46:44.170754913Z 2023-11-24T13:46:44.170582Z  WARN lookup{req=44 ino=1 name="Uploads"}: mountpoint_s3::fuse: lookup failed: inode error: file does not exist
¾2023-11-24T13:46:44.458244689Z 2023-11-24T13:46:44.458094Z  WARN lookup{req=46 ino=2 name="01hg0s363ta4kkvwyhcgvk83zc"}: mountpoint_s3::fuse: lookup failed: inode error: file does not exist
¾2023-11-24T13:46:51.310283712Z 2023-11-24T13:46:51.310113Z  WARN readdirplus{req=52 ino=1 fh=2 offset=1}: mountpoint_s3::fuse: readdirplus failed: out-of-order readdir, expected=4, actual=1

cc @dannycjones @passaro

tchaton commented 7 months ago

Addtitionally, we are observing a CPU spike every minute with --enable-metadata-caching --metadata-cache-ttl 60. I was hoping the listing would be lazy e.g if the users don't list or interact with the mount, no listing is done.

passaro commented 7 months ago

Hi @tchaton, thanks for raising the issue. I see you were using a custom build of 1.1.1 with caching. Have you since upgraded to 1.2.0? Note that the flags to configure caching are different from the pre-release version. Once you upgrade, could you report if you are still observing the issue on 1.2.0?

Are you able to share more details on the workload you ran before seeing the error on ls? Do you get similar errors when running other commands? Or just ls? Is the mount-s3 process still running when the error occurs?

EDIT: for help with the new configuration flags, see this section in the docs.

passaro commented 7 months ago

About the CPU spikes: Mountpoint does not proactively refresh metadata when it expires. So it should behave just as you were expecting. I suspect that the activity you are observing is due to applications accessing the filesystem and the kernel in turn requesting updated metadata from Mountpoint.

tchaton commented 7 months ago

Hey @passaro Let me update and give you more feedbacks.

tchaton commented 7 months ago

@passaro But if you want to see some failures, you can do something like this.

Create 1 bucket with 1M files with random sizes ranging from 100kb to 10GB.

And copy all the files from the mount to another bucket while trying to maximize the CPU usage of the machine to 100%( I am using a machine with 32 or 64 CPU cores).

docker run --rm -v ~/.aws:/root/.aws -v /{mount_to_bucket_1}/:/data/ peakcom/s5cmd --numworkers {2 * cpu_cores} cp /data/ s3://bucket_2

This always fails for me. However, other open source solutions are more reliable under that same stress.

passaro commented 7 months ago

@tchaton, unfortunately, I was not able to reproduce the issue with the command you suggested. It may depend on specific factors like the content of your bucket or the load on your instance.

However, my (unconfirmed) suspicion is that you are seeing the result of an out of memory issue, similar to that reported in #502. Would you be able to verify if your syslog contains lines similar to these (once you reproduce the Transport endpoint is not connected error):

kernel: Out of memory: Killed process 2684 (mount-s3)
systemd[1]: session-1.scope: A process of this unit has been killed by the OOM killer. 
systemd[1]: session-1.scope: Killing process 3172 (docker) with signal SIGKILL.
tchaton commented 6 months ago

Hey @passaro I will try again. For the syslog, what do you mean exactly ? How can check them ?

passaro commented 6 months ago

You can probably use journalctl. For example, the lines I copied above were extracted from the output of this command:

journalctl -t systemd -t kernel

journalctl should be available on most modern Linux distributions, including Amazon Linux. On other systems, syslog entries are likely written to a file such as /var/log/syslog.

nguyenminhdungpg commented 2 months ago

I also encountered this error when using s3fs and now mountpoint-s3.

I am applying a solution that I described in this comment: https://github.com/s3fs-fuse/s3fs-fuse/issues/2356#issuecomment-1791770501