tar: Periodic failures when extracting a tarball on a SMB share

ppettina commented 3 years ago

Describe the bug Running tar -xf /path/to/tarball.gz -C /some/cifs/mount fails sporadically. I believe this is an occurrence of https://access.redhat.com/solutions/5493691, relatively fresh Which in turn looks like https://jira.whamcloud.com/browse/LU-305 albeit this one is not directly related

Reproduction steps Steps to reproduce the behavior:

Run tar -xf /path/to/tarball.tar.gz -C /some/cifs/mount

Expected behavior Command succeeds, with content of the archives extracted in the correct folder

Actual behavior Command sporadically fails with:

tar: path/to/file/in/archive: Cannot utime: Interrupted system call
tar: Exiting with failure status due to previous errors

System details

VMWare deployment
Fedora CoreOS version 32.20201104.3.0
kernel 5.8.17-200.fc32.x86_64

Ignition config Probably not relevant?

Additional information Not sure there's much we can do here; if RHEL fixes the issue, how long do we expect it take to propagate down to FCOS?

ppettina commented 3 years ago

strace -f output with relevant filtering (-e signal=SIGCHLD -e trace=utimensat):

...
[pid 464743] utimensat(8, NULL, [UTIME_OMIT, {tv_sec=1605788182, tv_nsec=0} /* 2020-11-19T12:16:22+0000 */], 0) = 0
[pid 464743] utimensat(8, NULL, [UTIME_OMIT, {tv_sec=1605788182, tv_nsec=0} /* 2020-11-19T12:16:22+0000 */], 0) = 0
[pid 464743] utimensat(8, NULL, [UTIME_OMIT, {tv_sec=1605788182, tv_nsec=0} /* 2020-11-19T12:16:22+0000 */], 0) = 0
[pid 464743] utimensat(8, NULL, [UTIME_OMIT, {tv_sec=1605788182, tv_nsec=0} /* 2020-11-19T12:16:22+0000 */], 0) = 0
[pid 464743] utimensat(8, NULL, [UTIME_OMIT, {tv_sec=1605788182, tv_nsec=0} /* 2020-11-19T12:16:22+0000 */], 0 <unfinished ...>
[pid 464744] +++ exited with 0 +++
<... utimensat resumed>)                = -1 EINTR (Interrupted system call)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=464744, si_uid=1000, si_status=0, si_utime=26, si_stime=3} ---
tar: ffmpeg/CA5B7F464AEA83D5018DE264A411CBDA0/ffmpeg.sym: Cannot utime: Interrupted system call
utimensat(7, "ffmpeg/CA5B7F464AEA83D5018DE264A411CBDA0", [UTIME_OMIT, {tv_sec=1605788182, tv_nsec=0} /* 2020-11-19T12:16:22+0000 */], 0) = 0
utimensat(7, "ffmpeg", [UTIME_OMIT, {tv_sec=1605788182, tv_nsec=0} /* 2020-11-19T12:16:22+0000 */], 0) = 0
utimensat(8, NULL, [UTIME_OMIT, {tv_sec=1605788182, tv_nsec=0} /* 2020-11-19T12:16:22+0000 */], 0) = 0
utimensat(7, "nice/3647C7556D3C635621CA0395E129A0560", [UTIME_OMIT, {tv_sec=1605788182, tv_nsec=0} /* 2020-11-19T12:16:22+0000 */], 0) = 0
utimensat(7, "nice", [UTIME_OMIT, {tv_sec=1605788182, tv_nsec=0} /* 2020-11-19T12:16:22+0000 */], 0) = 0
tar: Exiting with failure status due to previous errors
+++ exited with 2 +++

which is consistent with SIGCHLD interrupting the utimensat syscall.

ppettina commented 3 years ago

Obvious workaround (for those coming here for a solution) is splitting the call:

gzip -dc /path/to/tarball.gz | tar -x -C /some/cifs/mount

lucab commented 3 years ago

Thanks for the report. How reproducible is this issue that you are observing? And can you share some more details regarding the server-side which provides this SMB share, and how the CIFS mount is provisioned on the FCOS node?

Indeed it looks like you are hitting https://bugzilla.redhat.com/show_bug.cgi?id=1848178 (private, investigation ongoing). I don't have any timing insights to share at this point, but once it get fixed upstream we can track it and make sure it quickly reaches FCOS too.

ppettina commented 3 years ago

Thanks @lucab .

Issue happens about 1 in 5 times. Note that I'm rerunning the same command over and over, thus overwriting the files - not sure if it makes a difference.

Server is running Ubuntu 16.04.6 LTS (GNU/Linux 4.4.0-145-generic x86_64), CIFS mount is in /etc/fstab:

//server/path /var/mnt/path/to/mount cifs rw,exec,uid=1000,gid=1000,credentials=/etc/creds_file,vers=1.0 0 0

We use vers=1.0 because we were having issues writing to the mount. Can't remember the details off the top of my head though.

AFAICT looks exactly like https://jira.whamcloud.com/browse/LU-305, interestingly on RH. Points to a bug in libc, and/or something that can be worked around in tar.

coreos / fedora-coreos-tracker

tar: Periodic failures when extracting a tarball on a SMB share #673