GoogleCloudPlatform / gcsfuse

A user-space file system for interacting with Google Cloud Storage
https://cloud.google.com/storage/docs/gcs-fuse
Apache License 2.0
2.04k stars 421 forks source link

IO Error: interrupt out of nowhere? #288

Closed gorjanradevski closed 1 month ago

gorjanradevski commented 6 years ago

I have spun up a new GPU instance on Google Cloud that I am using to train my model. The instance reads the data from the persistent disk and logs the checkpoints and log files into a Google Cloud storage bucket that is mounted with gcsfuse. Unfortunately I keep getting an IO error (see below)

INFO:training.training:Found new best! Metric: 0.39190895331317616
INFO:training.training:Saving checkpoint at global step 176
Traceback (most recent call last):
  File "src/train_script.py", line 118, in <module>
    main()
  File "src/train_script.py", line 112, in main
    args.train_path, args.val_path, logdir, args.checkpoint, hparams
  File "/home/gorjanradevski/prosiebensat1/src/training/training.py", line 228, in train_and_evaluate
    checkpoint=logdir,
  File "/home/gorjanradevski/prosiebensat1/src/training/utils.py", line 51, in save_checkpoint
    torch.save(state, filepath)
  File "/home/gorjanradevski/.local/share/virtualenvs/prosiebensat1-YKpXIpQK/lib/python3.6/site-packages/torch/serialization.py", line 161, in save
    return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
  File "/home/gorjanradevski/.local/share/virtualenvs/prosiebensat1-YKpXIpQK/lib/python3.6/site-packages/torch/serialization.py", line 116, in _with_file_like
    f = open(f, mode)
OSError: [Errno 5] Input/output error: 'log2/logging4/last.pth.tar'

After looking at the output from gcsfuse --foreground --debug_gcs --debug_fuse it seems like the problem is an interrupt coming from somewhere. TBH I have no idea where this interrupt might come from. If I run the train script on a local directory (not a bucket mounted with gcsfuse) everything works fine.

gcs: Req             0x2e: <- StatObject("logging4/last.pth.tar/")
gcs: Req             0x2f: <- StatObject("logging4/last.pth.tar")
gcs: Req             0x2d: -> Read("logging4/events.out.tfevents.1532356845.gorjan-k80-30gbs", <nil>) (84.454985ms): OK
fuse_debug: Op 0x0000005b        connection.go:479] -> OK ()
fuse_debug: Op 0x0000005c        connection.go:479] -> OK ()
fuse_debug: Op 0x00000060        connection.go:396] <- FlushFile (inode 7)
fuse_debug: Op 0x00000061        connection.go:396] <- WriteFile (inode 7, handle 3, offset 8192, 3701 bytes)
fuse_debug: Op 0x00000062        connection.go:396] <- SetInodeAttributes (inode 7, mtime 2018-07-23 14:41:01.192127484 +0000 UTC)
gcs: Req             0x30: <- CreateObject("logging4/events.out.tfevents.1532356845.gorjan-k80-30gbs")
gcs: Req             0x2f: -> StatObject("logging4/last.pth.tar") (182.243631ms): gcs.NotFoundError: googleapi: Error 404: No such object: luminovo-prosiebensat1-logs/logging4/last.pth.tar, notFound
gcs: Req             0x2e: -> StatObject("logging4/last.pth.tar/") (182.952289ms): gcs.NotFoundError: googleapi: Error 404: No such object: luminovo-prosiebensat1-logs/logging4/last.pth.tar/, notFound
fuse_debug: Op 0x0000005f        connection.go:481] -> Error: "no such file or directory"
fuse_debug: Op 0x00000063        connection.go:396] <- CreateFile (parent 5, name "last.pth.tar")
gcs: Req             0x31: <- CreateObject("logging4/last.pth.tar")
gcs: Req             0x30: -> CreateObject("logging4/events.out.tfevents.1532356845.gorjan-k80-30gbs") (265.8374ms): OK
fuse_debug: Op 0x00000060        connection.go:479] -> OK ()
gcs: Req             0x32: <- Read("logging4/events.out.tfevents.1532356845.gorjan-k80-30gbs", <nil>)
fuse_debug: Op 0x00000064        connection.go:396] <- interrupt (fuseid 0x00000063)
gcs: Req             0x31: -> CreateObject("logging4/last.pth.tar") (158.469322ms): Put https://www.googleapis.com/upload/storage/v1/b/luminovo-prosiebensat1-logs/o?ifGenerationMatch=0&projection=full&uploadType=resumable&upload_id=AEnB2UplIquptC6zgoZrfSBzQNAwsBZZSu8fOXIK0wR9ZN_U4eyRgSRORQLQivVhuk1PGiyUaVfW4HiF2mvWpg0Tg1h1zn2swh8ukbuwNfDKKZjj0ywqksk: net/http: request canceled
fuse_debug: Op 0x00000063        connection.go:481] -> Error: "CreateChildFile: Put https://www.googleapis.com/upload/storage/v1/b/luminovo-prosiebensat1-logs/o?ifGenerationMatch=0&projection=full&uploadType=resumable&upload_id=AEnB2UplIquptC6zgoZrfSBzQNAwsBZZSu8fOXIK0wR9ZN_U4eyRgSRORQLQivVhuk1PGiyUaVfW4HiF2mvWpg0Tg1h1zn2swh8ukbuwNfDKKZjj0ywqksk: net/http: request canceled"
fuse: 2018/07/23 14:41:01.543811 *fuseops.CreateFileOp error: CreateChildFile: Put https://www.googleapis.com/upload/storage/v1/b/luminovo-prosiebensat1-logs/o?ifGenerationMatch=0&projection=full&uploadType=resumable&upload_id=AEnB2UplIquptC6zgoZrfSBzQNAwsBZZSu8fOXIK0wR9ZN_U4eyRgSRORQLQivVhuk1PGiyUaVfW4HiF2mvWpg0Tg1h1zn2swh8ukbuwNfDKKZjj0ywqksk: net/http: request canceled
gcs: Req             0x32: -> Read("logging4/events.out.tfevents.1532356845.gorjan-k80-30gbs", <nil>) (95.750685ms): OK
fuse_debug: Op 0x00000061        connection.go:479] -> OK ()
fuse_debug: Op 0x00000062        connection.go:479] -> OK ()

Any help would be greatly appreciated :)

jacobsa commented 6 years ago

Maybe try running gcsfuse under strace to see if you can figure out the source of the interrupt?

gorjanradevski commented 6 years ago

futex(0xabaeb0, FUTEX_WAIT, 0, NULLgcs: Req 0x13: -> CreateObject("debugtrain2-model:resnet18-pretrained:False--lr:0.01-CyclicalLR-wd:0.0001-F0E/events.out.tfevents.1532684103.dl-instance2") (270.688859ms): OK fuse_debug: Op 0x0000002b connection.go:479] -> OK () ) = 0 gcs: Req 0x15: <- Read("debugtrain2-model:resnet18-pretrained:False--lr:0.01-CyclicalLR-wd:0.0001-F0E/events.out.tfevents.1532684103.dl-instance2", <nil>) write(1, "fuse_debug: Op 0x0000002f "..., 85fuse_debug: Op 0x0000002f connection.go:396] <- interrupt (fuseid 0x0000002e) ) = 85 read(10, gcs: Req 0x14: -> CreateObject("debugtrain2-model:resnet18-pretrained:False--lr:0.01-CyclicalLR-wd:0.0001-F0E/last.pth.tar") (196.018391ms): Put https://www.googleapis.com/upload/storage/v1/b/tryout_fuse2/o?ifGenerationMatch=0&projection=full&uploadType=resumable&upload_id=AEnB2UpCfQtZa2TElSOD2HZP9gPpgiqtbvr2Q35kYpVxt2Z9SGwxoJE1tsZfIYIrVsX7THbzXFmNT0sKRdJXb-5nqQsgbqisik8moAfbb1z4I87qolM3i_A: net/http: request canceled

This is what I get when I run strace gcsfuse --foreground --debug_gcs --debug_fuse <bucket name> <directory name>

However, this tells me nothing about the error. I still have no idea from where it is coming from.

jacobsa commented 6 years ago

Can you run with strace but not the gcsfuse debugging flags, in order to get a trace that’s not all mucked up with intermingled logs?

gorjanradevski commented 6 years ago

Is this more helpful?

fuse: 2018/07/29 17:06:39.363295 *fuseops.CreateFileOp error: CreateChildFile: Put https://www.googleapis.com/upload/storage/v1/b/tryout-fuse/o?ifGenerationMatch=0&projection=full&uploadType=resumable&upload_id=AEnB2UrVGRw3Kr2L4mnAAICtSx1ysEf8f3PdAFxDjtco2qUoMqBXjuHxz0B59nOKuVRaXZ4Ie4A7eovcDhZegh5FInrAcinaGptZQYffOwaVgECYZ7bhAuY: net/http: request canceled futex(0xabaeb0, FUTEX_WAIT, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xabaeb0, FUTEX_WAIT, 0, NULL) = 0 futex(0xab9f00, FUTEX_WAKE, 1) = 1 futex(0xc42011e490, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=89, tv_nsec=805024024}) = 0 futex(0xc42002f290, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=89, tv_nsec=999897465}) = 0 futex(0xaba498, FUTEX_WAKE, 1) = 1 futex(0xc42002f290, FUTEX_WAKE, 1) = 1 futex(0xc42011e490, FUTEX_WAKE, 1) = 1 futex(0xc42011e490, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=89, tv_nsec=999020506}) = 0 futex(0xabaeb0, FUTEX_WAIT, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) epoll_wait(4, [], 128, 0) = 0 futex(0xabaeb0, FUTEX_WAIT, 0, NULL) = 0 epoll_wait(4, [{EPOLLOUT, {u32=1111547376, u64=140669880425968}}], 128, 0) = 1 futex(0xc42002f290, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=89, tv_nsec=999896692}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} --- rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=77, tv_nsec=879846836}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} --- rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=65, tv_nsec=651538671}) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} --- rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=30, tv_nsec=114080962}) = -1 ETIMEDOUT (Connection timed out) futex(0xaba498, FUTEX_WAKE, 1) = 1 futex(0xaba3d0, FUTEX_WAKE, 1) = 1 futex(0xc42002ef10, FUTEX_WAKE, 1) = 1 futex(0xc42011e490, FUTEX_WAKE, 1) = 1 futex(0xab9f00, FUTEX_WAIT, 0, {tv_sec=442, tv_nsec=287940936}

I get this log when running strace gcsfuse --foreground <bucket_name> <directory_name>

jacobsa commented 6 years ago

You probably need to use the -f argument to strace, in order to trace all gcsfuse threads. We’re looking for information on the interrupt.

Now that I think harder about it though, the interrupt might be in the caller of the file system, not gcsfuse. Perhaps you should strace it instead.

ykim104 commented 5 years ago

Hi! Do you have an update to this problem?

I have a same issue. I give my instance full storage access and make sure I give my service account full access just in case as well. I still get an interrupt in middle of training usually after one or two epochs. It seems like the permission changes randomly over time and my VM instances get locked out. So I’m unable to read images from the bucket. I’m not sure why.

carbocation commented 4 years ago

I "intermittently but frequently" get the same issue (i/o error despite having full GCP permissions) when accessing data mounted by gcsfuse. I've opened an issue on the gcsfuse project, but it's not clear to me whether gcsfuse is the problem or whether Google Storage is erroring at a much higher rate than anticipated.

fregataa commented 3 years ago

hi, I have the exactly same issue now. when I save a model trained in my instance, the Input/Output Error occurs often randomly. is this still unresolved??

nsht commented 3 years ago

Is there any way to ignore these interrupt signals? We have a use case where we need to ensure the file is uploaded to the bucket regardless of any interrupts.

kunsonx commented 3 years ago

Hi there. I have the exactly same issue also.

Use go v.1.17.0 build code with copy files from disk to Google Storage bucker mount by gcsfuse.

The "operation canceled" occurs often and randomly... (seems no reason?!). Process still running not kill.

machine: GCP no GPU instance; ubuntu 18.04;

kernel version:

5.4.0-1040-gcp

context logs:

{"level":"ERROR","ts":"2021-08-25T01:26:57.370+0800","logger":"migrator","caller":"migrator.go:65","msg":"transfer files failed","pid":1,"err":"open xxx: operation canceled"}
{"level":"ERROR","ts":"2021-08-25T01:27:08.225+0800","logger":"migrator","caller":"migrator.go:65","msg":"transfer files failed","pid":1,"err":"open xxx: operation canceled"}
dwilliams782 commented 2 years ago

Also seeing a lot of operation canceled IO errors.


>Op 0x0000008f connection.go:416] <- interrupt (fuseid 0x0000008e)

>Not retrying Read(".atlantis/atlantis.db", 1640797111622149) after error of type *url.Error ("Get \"https://www.googleapis.com:443/download/storage/v1/b/<my-bucket>/o/.atlantis%2Fatlantis.db?alt=media&generation=1640797111622149\": net/http: request canceled"): &url.Error{Op:"Get", URL:"https://www.googleapis.com:443/download/storage/v1/b/runatlantis-playground/o/.atlantis%2Fatlantis.db?alt=media&generation=1640797111622149", Err:(*errors.errorString)(0xc0000433b0)}" 

>ReadFile: operation canceled, fh.reader.ReadAt: readFull: Get "https://www.googleapis.com:443/download/storage/v1/b/<my-bucket>/o/.atlantis%2Fatlantis.db?alt=media&generation=1640797111622149": net/http: request canceled
flufy3d commented 2 years ago

Also seeing a lot of operation canceled IO errors.

go1.16 with Google Cloud Run Container

Darktex commented 2 years ago

I also see this while writing TensorBoard logs on GCS while training on Cloud AI Platform:

Exception in thread Thread-10:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/summary/writer/event_file_writer.py", line 233, in run
    self._record_writer.write(data)
  File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/summary/writer/record_writer.py", line 40, in write
    self._writer.write(header + header_crc + data + footer_crc)
  File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/compat/tensorflow_stub/io/gfile.py", line 766, in write
    self.fs.append(self.filename, file_content, self.binary_mode)
  File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/compat/tensorflow_stub/io/gfile.py", line 160, in append
    self._write(filename, file_content, "ab" if binary_mode else "a")
  File "/opt/conda/lib/python3.7/site-packages/tensorboard-2.9.1-py3.7.egg/tensorboard/compat/tensorflow_stub/io/gfile.py", line 166, in _write
    f.write(compatify(file_content))
OSError: [Errno 5] Input/output error
sethiay commented 1 year ago

This is same as https://github.com/GoogleCloudPlatform/gcsfuse/issues/1016.

ashmeenkaur commented 1 month ago

Update: We've addressed this issue in GCSfuse v2.3 (released June 28, 2024) by changing the default behavior to ignore interrupts during file system operations. This should prevent the problem described here. Please upgrade to the latest version and let us know if you continue to experience any difficulties.