Closed gorjanradevski closed 1 month ago
Maybe try running gcsfuse under strace to see if you can figure out the source of the interrupt?
futex(0xabaeb0, FUTEX_WAIT, 0, NULLgcs: Req 0x13: -> CreateObject("debugtrain2-model:resnet18-pretrained:False--lr:0.01-CyclicalLR-wd:0.0001-F0E/events.out.tfevents.1532684103.dl-instance2") (270.688859ms): OK fuse_debug: Op 0x0000002b connection.go:479] -> OK () ) = 0 gcs: Req 0x15: <- Read("debugtrain2-model:resnet18-pretrained:False--lr:0.01-CyclicalLR-wd:0.0001-F0E/events.out.tfevents.1532684103.dl-instance2", <nil>) write(1, "fuse_debug: Op 0x0000002f "..., 85fuse_debug: Op 0x0000002f connection.go:396] <- interrupt (fuseid 0x0000002e) ) = 85 read(10, gcs: Req 0x14: -> CreateObject("debugtrain2-model:resnet18-pretrained:False--lr:0.01-CyclicalLR-wd:0.0001-F0E/last.pth.tar") (196.018391ms): Put https://www.googleapis.com/upload/storage/v1/b/tryout_fuse2/o?ifGenerationMatch=0&projection=full&uploadType=resumable&upload_id=AEnB2UpCfQtZa2TElSOD2HZP9gPpgiqtbvr2Q35kYpVxt2Z9SGwxoJE1tsZfIYIrVsX7THbzXFmNT0sKRdJXb-5nqQsgbqisik8moAfbb1z4I87qolM3i_A: net/http: request canceled
This is what I get when I run strace gcsfuse --foreground --debug_gcs --debug_fuse <bucket name> <directory name>
However, this tells me nothing about the error. I still have no idea from where it is coming from.
Can you run with strace but not the gcsfuse debugging flags, in order to get a trace that’s not all mucked up with intermingled logs?
Is this more helpful?
fuse: 2018/07/29 17:06:39.363295 *fuseops.CreateFileOp error: CreateChildFile: Put https://www.googleapis.com/upload/storage/v1/b/tryout-fuse/o?ifGenerationMatch=0&projection=full&uploadType=resumable&upload_id=AEnB2UrVGRw3Kr2L4mnAAICtSx1ysEf8f3PdAFxDjtco2qUoMqBXjuHxz0B59nOKuVRaXZ4Ie4A7eovcDhZegh5FInrAcinaGptZQYffOwaVgECYZ7bhAuY: net/http: request canceled futex(0xabaeb0, FUTEX_WAIT, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xabaeb0, FUTEX_WAIT, 0, NULL) = 0 futex(0xab9f00, FUTEX_WAKE, 1) = 1 futex(0xc42011e490, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=89, tv_nsec=805024024}) = 0 futex(0xc42002f290, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=89, tv_nsec=999897465}) = 0 futex(0xaba498, FUTEX_WAKE, 1) = 1 futex(0xc42002f290, FUTEX_WAKE, 1) = 1 futex(0xc42011e490, FUTEX_WAKE, 1) = 1 futex(0xc42011e490, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=89, tv_nsec=999020506}) = 0 futex(0xabaeb0, FUTEX_WAIT, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) epoll_wait(4, [], 128, 0) = 0 futex(0xabaeb0, FUTEX_WAIT, 0, NULL) = 0 epoll_wait(4, [{EPOLLOUT, {u32=1111547376, u64=140669880425968}}], 128, 0) = 1 futex(0xc42002f290, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=89, tv_nsec=999896692}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} --- rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=77, tv_nsec=879846836}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} --- rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=65, tv_nsec=651538671}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} --- rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=30, tv_nsec=114080962}) = -1 ETIMEDOUT (Connection timed out) futex(0xaba498, FUTEX_WAKE, 1) = 1 futex(0xaba3d0, FUTEX_WAKE, 1) = 1 futex(0xc42002ef10, FUTEX_WAKE, 1) = 1 futex(0xc42011e490, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=442, tv_nsec=287940936}
I get this log when running strace gcsfuse --foreground <bucket_name> <directory_name>
You probably need to use the -f argument to strace, in order to trace all gcsfuse threads. We’re looking for information on the interrupt.
Now that I think harder about it though, the interrupt might be in the caller of the file system, not gcsfuse. Perhaps you should strace it instead.
Hi! Do you have an update to this problem?
I have a same issue. I give my instance full storage access and make sure I give my service account full access just in case as well. I still get an interrupt in middle of training usually after one or two epochs. It seems like the permission changes randomly over time and my VM instances get locked out. So I’m unable to read images from the bucket. I’m not sure why.
I "intermittently but frequently" get the same issue (i/o error despite having full GCP permissions) when accessing data mounted by gcsfuse. I've opened an issue on the gcsfuse project, but it's not clear to me whether gcsfuse is the problem or whether Google Storage is erroring at a much higher rate than anticipated.
hi, I have the exactly same issue now. when I save a model trained in my instance, the Input/Output Error occurs often randomly. is this still unresolved??
Is there any way to ignore these interrupt signals? We have a use case where we need to ensure the file is uploaded to the bucket regardless of any interrupts.
Hi there. I have the exactly same issue also.
Use go v.1.17.0 build code with copy files from disk to Google Storage bucker mount by gcsfuse.
The "operation canceled" occurs often and randomly... (seems no reason?!). Process still running not kill.
machine: GCP no GPU instance; ubuntu 18.04;
kernel version:
5.4.0-1040-gcp
context logs:
{"level":"ERROR","ts":"2021-08-25T01:26:57.370+0800","logger":"migrator","caller":"migrator.go:65","msg":"transfer files failed","pid":1,"err":"open xxx: operation canceled"} {"level":"ERROR","ts":"2021-08-25T01:27:08.225+0800","logger":"migrator","caller":"migrator.go:65","msg":"transfer files failed","pid":1,"err":"open xxx: operation canceled"}
Also seeing a lot of operation canceled
IO errors.
>Op 0x0000008f connection.go:416] <- interrupt (fuseid 0x0000008e)
>Not retrying Read(".atlantis/atlantis.db", 1640797111622149) after error of type *url.Error ("Get \"https://www.googleapis.com:443/download/storage/v1/b/<my-bucket>/o/.atlantis%2Fatlantis.db?alt=media&generation=1640797111622149\": net/http: request canceled"): &url.Error{Op:"Get", URL:"https://www.googleapis.com:443/download/storage/v1/b/runatlantis-playground/o/.atlantis%2Fatlantis.db?alt=media&generation=1640797111622149", Err:(*errors.errorString)(0xc0000433b0)}"
>ReadFile: operation canceled, fh.reader.ReadAt: readFull: Get "https://www.googleapis.com:443/download/storage/v1/b/<my-bucket>/o/.atlantis%2Fatlantis.db?alt=media&generation=1640797111622149": net/http: request canceled
Also seeing a lot of operation canceled IO errors.
go1.16 with Google Cloud Run Container
I also see this while writing TensorBoard logs on GCS while training on Cloud AI Platform:
Exception in thread Thread-10:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/summary/writer/event_file_writer.py", line 233, in run
self._record_writer.write(data)
File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/summary/writer/record_writer.py", line 40, in write
self._writer.write(header + header_crc + data + footer_crc)
File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/compat/tensorflow_stub/io/gfile.py", line 766, in write
self.fs.append(self.filename, file_content, self.binary_mode)
File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/compat/tensorflow_stub/io/gfile.py", line 160, in append
self._write(filename, file_content, "ab" if binary_mode else "a")
File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/compat/tensorflow_stub/io/gfile.py", line 166, in _write
f.write(compatify(file_content))
OSError: [Errno 5] Input/output error
This is same as https://github.com/GoogleCloudPlatform/gcsfuse/issues/1016.
Update: We've addressed this issue in GCSfuse v2.3 (released June 28, 2024) by changing the default behavior to ignore interrupts during file system operations. This should prevent the problem described here. Please upgrade to the latest version and let us know if you continue to experience any difficulties.
I have spun up a new GPU instance on Google Cloud that I am using to train my model. The instance reads the data from the persistent disk and logs the checkpoints and log files into a Google Cloud storage bucket that is mounted with gcsfuse. Unfortunately I keep getting an IO error (see below)
After looking at the output from
gcsfuse --foreground --debug_gcs --debug_fuse
it seems like the problem is an interrupt coming from somewhere. TBH I have no idea where this interrupt might come from. If I run the train script on a local directory (not a bucket mounted with gcsfuse) everything works fine.Any help would be greatly appreciated :)