Open tchaton opened 7 months ago
Addtitionally, we are observing a CPU spike every minute with --enable-metadata-caching --metadata-cache-ttl 60. I was hoping the listing would be lazy e.g if the users don't list or interact with the mount, no listing is done.
Hi @tchaton, thanks for raising the issue. I see you were using a custom build of 1.1.1 with caching. Have you since upgraded to 1.2.0? Note that the flags to configure caching are different from the pre-release version. Once you upgrade, could you report if you are still observing the issue on 1.2.0?
Are you able to share more details on the workload you ran before seeing the error on ls
? Do you get similar errors when running other commands? Or just ls
? Is the mount-s3
process still running when the error occurs?
EDIT: for help with the new configuration flags, see this section in the docs.
About the CPU spikes: Mountpoint does not proactively refresh metadata when it expires. So it should behave just as you were expecting. I suspect that the activity you are observing is due to applications accessing the filesystem and the kernel in turn requesting updated metadata from Mountpoint.
Hey @passaro Let me update and give you more feedbacks.
@passaro But if you want to see some failures, you can do something like this.
Create 1 bucket with 1M files with random sizes ranging from 100kb to 10GB.
And copy all the files from the mount to another bucket while trying to maximize the CPU usage of the machine to 100%( I am using a machine with 32 or 64 CPU cores).
docker run --rm -v ~/.aws:/root/.aws -v /{mount_to_bucket_1}/:/data/ peakcom/s5cmd --numworkers {2 * cpu_cores} cp /data/ s3://bucket_2
This always fails for me. However, other open source solutions are more reliable under that same stress.
@tchaton, unfortunately, I was not able to reproduce the issue with the command you suggested. It may depend on specific factors like the content of your bucket or the load on your instance.
However, my (unconfirmed) suspicion is that you are seeing the result of an out of memory issue, similar to that reported in #502.
Would you be able to verify if your syslog contains lines similar to these (once you reproduce the Transport endpoint is not connected
error):
kernel: Out of memory: Killed process 2684 (mount-s3)
systemd[1]: session-1.scope: A process of this unit has been killed by the OOM killer.
systemd[1]: session-1.scope: Killing process 3172 (docker) with signal SIGKILL.
Hey @passaro I will try again. For the syslog
, what do you mean exactly ? How can check them ?
You can probably use journalctl. For example, the lines I copied above were extracted from the output of this command:
journalctl -t systemd -t kernel
journalctl
should be available on most modern Linux distributions, including Amazon Linux. On other systems, syslog entries are likely written to a file such as /var/log/syslog
.
I also encountered this error when using s3fs and now mountpoint-s3.
I am applying a solution that I described in this comment: https://github.com/s3fs-fuse/s3fs-fuse/issues/2356#issuecomment-1791770501
Mountpoint for Amazon S3 version
1.1.1 with caching
AWS Region
us-east-1
Describe the running environment
Running on Amazon EC2
What happened?
This is happening quite frequentally ~ 7/10 for us in our filesystem tests.
Relevant log output
The only log line I can see is the following.
cc @dannycjones @passaro