GoogleCloudPlatform / gcsfuse

A user-space file system for interacting with Google Cloud Storage
https://cloud.google.com/storage/docs/gcs-fuse
Apache License 2.0
2.05k stars 426 forks source link

gcsfuse mount hangs in Ubuntu #715

Closed apolishchuk-clgx closed 9 months ago

apolishchuk-clgx commented 2 years ago

Hello,

We have Ubuntu 18.04.6 LTS with GCSfuse version 0.41.4. Mounting any bucket that used to work before now hangs. Any disk command, such as ls or df on the mount point or its parent hang.

We've removed snap and lxcfs packages.

The mount command is the following, but changing the options does not help.

/usr/bin/gcsfuse --foreground -o rw --implicit-dirs -o allow_other --uid 30000 --gid 30000 --max-retry-sleep 0 --log-file /root/fuse.log --log-format text --debug_fuse --debug_fs --debug_gcs --debug_mutex test-bucket /gcs_test

Debug shows the following output:

I0712 14:54:50.732140 Start gcsfuse/0.41.4 (Go version go1.17.6) for app "" using mount point: /gcs_test N0712 14:54:50.732354 Opening GCS connection... I0712 14:54:50.735417 Creating a mount at "/gcs_test" I0712 14:54:50.736375 Creating a new server... I0712 14:54:50.736414 Set up root directory for bucket test-bucket I0712 14:54:50.736441 OpenBucket("test-bucket", "") D0712 14:54:50.736470 gcs: Req 0x0: <- ListObjects("") D0712 14:54:50.847442 gcs: Req 0x0: -> ListObjects("") (110.959485ms): OK D0712 14:54:50.847503 gcs: Req 0x1: <- ListObjects("") D0712 14:54:50.920346 gcs: Req 0x1: -> ListObjects("") (72.831465ms): OK N0712 14:54:50.920521 Mounting file system "test-bucket"...

Output from strace shows futex(ADDRESS, FUTEX_WAIT_PRIVATE, 0, NULL) over and over.

Running "gsutil ls gs://test-bucket" works from the same VM.

Downgrading gcsfuse version to 0.39.2 or 0.41.0 did not help either.

avidullu commented 2 years ago

Can you remove the "--log-file /root/fuse.log --log-format text" options and try?

apolishchuk-clgx commented 2 years ago

Still the same problem.

/usr/bin/gcsfuse --foreground -o rw --implicit-dirs -o allow_other --uid 30000 --gid 30000 --max-retry-sleep 0 --debug_fuse --debug_fs --debug_gcs --debug_mutex test-bucket /gcs_test

2022/07/13 08:30:27.976920 Start gcsfuse/0.41.4 (Go version go1.17.6) for app "" using mount point: /gcs_test 2022/07/13 08:30:27.977279 Opening GCS connection... 2022/07/13 08:30:27.979593 Creating a mount at "/gcs_test" 2022/07/13 08:30:27.980207 Creating a new server... 2022/07/13 08:30:27.980231 Set up root directory for bucket test-bucket 2022/07/13 08:30:27.980240 OpenBucket("test-bucket", "") gcs: 2022/07/13 08:30:27.980251 Req 0x0: <- ListObjects("") gcs: 2022/07/13 08:30:28.112257 Req 0x0: -> ListObjects("") (131.994418ms): OK gcs: 2022/07/13 08:30:28.112418 Req 0x1: <- ListObjects("") gcs: 2022/07/13 08:30:28.180699 Req 0x1: -> ListObjects("") (68.275715ms): OK 2022/07/13 08:30:28.180942 Mounting file system "test-bucket"...

avidullu commented 2 years ago

Can you tell how do you think it is stuck? I mean do you open a separate terminal are you not able to go into the directory?

apolishchuk-clgx commented 2 years ago

That's right. I open a separate terminal and not able to list that directory or root directory "/". Running "df -h" also hangs.

apolishchuk-clgx commented 2 years ago

I had to uninstall lxcfs package, because it was hanging also independently of gcsfuse. Any attempt to run "ls /var/lib/lxd" or "df -h" was hanging. Just in case I uninstalled snap package, as it was causing problems in the past too. It seems to me that some system-wide setting or library has been changed in the latest build causing all this.

avidullu commented 2 years ago

I just tried to repro this issue on a freshly made GCP VM with Ubuntu 18.04.6 and I was able to access a bucket contents very easily using the same command as you mentioned.

I would recommend a. You seem to be using "root" privileges which is not a good mode to operate and seems your system has a lot of other complex libraries installed. We do not test gcsfuse with lxcfs or other libraries. Would be good to know if this is the behavior on some other distro or machine as well. Also this is a linux distro and not Docker/GKE/Containerized deployement?

b. You mentioned that you tried using older versions (0.39 and 0.38 as well) and they didn't work. Can you go further and see if at any point the combination works for you? If it does then we can definitely invest in understanding whether there has been any regression.

Is this a Docker container by any chance or a linux VM/machine?

At my end, I'll try to check if we can add some more logging to understand better what is happening but unfortunately I don't have any immediate mitigation for this.

apolishchuk-clgx commented 2 years ago

This is a Linux VM created from one of Google images, such as ubuntu-1804-bionic-v20220712 from ubuntu-1804-lts family.

The images from the beginning of the year, such as ubuntu-1804-bionic-v20220111 seem to work with fuse 0.39.

N214 commented 2 years ago

Had the same issue on a rhel vm. I had a to reboot the VM. Issue not easily reproducible.

dcosta-clgx commented 2 years ago

This seems like the same issue I'm having. Tried to build new GCP VMs, running CentOS 7.9, using existing Ansible scripts, which include installation of gcsfuse and mounting a bucket as a non-root user. The same version of the scripts worked successfully about 2 weeks ago. Now, it seems the attempt to mount the bucket hangs.

The bizarre part is that other VMs, which were built with the same base image and same version of gcsfuse, are working fine. I've tried rebooting and mounting as both root and as non-root user.

sethiay commented 1 year ago

Given that we have added more debugging logs to the mounting process, we request you to mount gcsfuse with --debug_fuse --debug_fs --debug_gcs --debug_http --foreground flags and share the logs with us if you are still facing the issue.

nicklasring commented 1 year ago

Having this issue started at 6th June, mount hung can still list files with gsutil.

gcsfuse version 0.42.1 (Go version go1.19.5)

Jun 15 16:03:03 hostname kernel: INFO: task df:29444 blocked for more than 120 seconds. Jun 15 16:03:03 hostname kernel: Not tainted 5.4.0-1073-gcp #78~18.04.1-Ubuntu Jun 15 16:03:03 hostname kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 15 16:03:03 hostname kernel: df D 0 29444 28995 0x00004004 Jun 15 16:03:03 hostname kernel: Call Trace: Jun 15 16:03:03 hostname kernel: __schedule+0x292/0x720 Jun 15 16:03:03 hostname kernel: schedule+0x33/0xa0 Jun 15 16:03:03 hostname kernel: request_wait_answer+0x12e/0x200 Jun 15 16:03:03 hostname kernel: ? __wake_up_pollfree+0x40/0x40 Jun 15 16:03:03 hostname kernel: fuse_simple_request+0x17b/0x290 Jun 15 16:03:03 hostname kernel: fuse_do_getattr+0xdc/0x320 Jun 15 16:03:03 hostname kernel: fuse_getattr+0xcf/0xf0 Jun 15 16:03:03 hostname kernel: vfs_getattr_nosec+0x98/0xb0 Jun 15 16:03:03 hostname kernel: vfs_getattr+0x36/0x40 Jun 15 16:03:03 hostname kernel: vfs_statx+0x8d/0xe0 Jun 15 16:03:03 hostname kernel: __do_sys_newstat+0x3d/0x70 Jun 15 16:03:03 hostname kernel: __x64_sys_newstat+0x16/0x20 Jun 15 16:03:03 hostname kernel: do_syscall_64+0x57/0x190 Jun 15 16:03:03 hostname kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Jun 15 16:03:03 hostname kernel: RIP: 0033:0x7f81dcbce725 Jun 15 16:03:03 hostname kernel: Code: Bad RIP value. Jun 15 16:03:03 hostname kernel: RSP: 002b:00007ffd8a4e7128 EFLAGS: 00000246 ORIG_RAX: 0000000000000004 Jun 15 16:03:03 hostname kernel: RAX: ffffffffffffffda RBX: 00005590f88cd0c0 RCX: 00007f81dcbce725 Jun 15 16:03:03 hostname kernel: RDX: 00007ffd8a4e71d0 RSI: 00007ffd8a4e71d0 RDI: 00005590f88cbf80 Jun 15 16:03:03 hostname kernel: RBP: 0000000000000000 R08: 00005590f88cc960 R09: 0000000000000000 Jun 15 16:03:03 hostname kernel: R10: 00005590f88c7010 R11: 0000000000000246 R12: 00005590f88cbfc0 Jun 15 16:03:03 hostname kernel: R13: 00005590f88cbf20 R14: 0000000000000000 R15: 00007ffd8a4e71b0

vadlakondaswetha commented 1 year ago

As stated in the above comment, please share the gcsfuse logs.

avnit commented 1 year ago

Having the same issue Step #2: #16 [12/29] RUN /var/www/html/gcsfuse_run.sh Step #2: #16 sha256:1c11aaee9b94bbdaa2460ba5d0112b6cd5b94270da98902260fa3fca743c6f34 Step #2: #16 0.427 Mounting GCS Fuse. Step #2: #16 0.448 {"name":"root","levelname":"INFO","severity":"INFO","message":"Start gcsfuse/1.0.1 (Go version go1.20.5) for app \"\" using mount point: /var/www/html/wp-content\n","timestampSeconds":1692570956,"timestampNanos":43246265} Step #2: #16 0.449 {"name":"root","levelname":"INFO","severity":"INFO","message":"Opening GCS connection...\n","timestampSeconds":1692570956,"timestampNanos":43620641} Step #2: #16 28.83 {"name":"root","levelname":"INFO","severity":"INFO","message":"Creating a mount at \"/var/www/html/wp-content\"\n","timestampSeconds":1692570984,"timestampNanos":427879535} Step #2: #16 28.83 {"name":"root","levelname":"INFO","severity":"INFO","message":"Creating a new server...\n","timestampSeconds":1692570984,"timestampNanos":428021285} Step #2: #16 28.83 {"name":"root","levelname":"INFO","severity":"INFO","message":"Set up root directory for bucket stateless-wordpress-gcloud-run-wp-demo\n","timestampSeconds":1692570984,"timestampNanos":428064241} Step #2: #16 28.83 {"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"gcs: Req 0x0: \u003c- ListObjects(\"\")\n","timestampSeconds":1692570984,"timestampNanos":428133023} The list object is not working with PHP docker image

raj-prince commented 1 year ago

Hi Avnit,

Thank you for reaching out.

As the gcsfuse logs indicate, the issue is occurring when listing objects in the GCS. There could be two possible causes:

If you are able to reproduce this issue every time, please provide concrete steps (decoupled from your system) to reproduce it. This will allow us to debug the issue and determine the root cause.

Thank you for your time and cooperation.

Best regards, Prince Kumar.

yelinaung commented 1 year ago

Hello @raj-prince and gcsfuse team, piggy-backing on this issue as I try to figure out something similar (Please let me know if I should open a new one instead)

I am running the following gcsfuse version and the spec

$ gcsfuse --version
gcsfuse version 1.1.0 (Go version go1.20.5)

kernel version

uname -r
5.10.176+

The command that I use to run

gcsfuse --debug_gcs --debug_fuse --log-format text --log_txt.txt <bucket name> /raw_data

This is all happening in GKE Pod, I am running the gcsfuse as a process at the start of the Pod. After I rolled out the version with gcsfuse integrated, I found that the Pods are stuck. Upon investigation, I found that in the log.txt file

$ cat log.txt
I0918 16:26:15.837974 Start gcsfuse/1.1.0 (Go version go1.20.5) for app "" using mount point: /raw_data
I0918 16:26:15.838011 Creating Storage handle...
I0918 16:26:15.838814 Creating a mount at "/raw_data"
I0918 16:26:15.838838 Creating a new server...
I0918 16:26:15.838846 Set up root directory for bucket <bucket_name>
D0918 16:26:15.838858 gcs: Req              0x0: <- ListObjects("")

I am not doing any command like ls or anything. I tried in on dev environment and there is no issue (i.e no ListObjects() operation). So is there a way to get around/disable this ?

Tulsishah commented 1 year ago

Hi @yelinaung , We are looking into this issue and will get back to you soon. Adding @songjiaxun to the thread as the issue is occurring on the GKE pod.

Thanks, Tulsi Shah

yelinaung commented 1 year ago

Hello @Tulsishah and @songjiaxun, Providing more context and some updates - I was running gcsfuse with Kube Container Lifecycle Hooks. i.e gcsfuse command runs at PostStart stage and get dismounted at PreStop. For some reasons that I am not sure, the gcsfuse process was stuck at ListObjects(""). Yesterday, as a workaround, I removed the gcsfuse from the PostStart and move it to the container ENTRYPOINT script,

# mount the bucket
gcsfuse ....
sleep 3

# app starts
python app.py

After that, I no longer face the stuck ListObjects("") issue! The PreStop step remains the same. So, maybe I was using the lifecycle hooks the wrong way ?

songjiaxun commented 1 year ago

Hi @yelinaung , could you provide the following information so that we can try to reproduce?

Thank you!

github-actions[bot] commented 9 months ago

Closing this issue as we haven't received any response in 30 days. Please reopen if you are still experiencing this issue.