docker / for-linux

Docker Engine for Linux
https://docs.docker.com/engine/installation/
753 stars 85 forks source link

Very specific set of circumstances leads to zero-byte (empty) file being created #1015

Closed pedantic-git closed 1 year ago

pedantic-git commented 4 years ago

The story

A few weeks ago, a bunch of tests in Rails code I have started failing in my Docker development environment. My dev environment is Arch so the actual set of libraries and things installed is very much a moving target.

Lots of digging around led me to create a repo that can reproduce the issue using only Ruby (no gems at all), at the lowest level possible. And it's really really weird. More information below.

Expected behavior

When copying a file in Ruby using the IO.copy_stream method, the file should be an identical copy.

Actual behavior

Under a very specific set of circumstances, the resulting file is 0 bytes.

I'm reasonably certain this is a Docker issue, because one of the conditions that needs to be true is that the source file is located on a mounted volume.

The conditions that need to be true for the IO.copy_stream operation to fail are:

  1. The source file is located on a Docker mounted volume and not the Docker container's own filesystem.
  2. The file is first copied into a Tempfile using the low-level Tempfile library.
  3. The tempfile is not touched (e.g. by utime or chmod) after the copy happens.
  4. That tempfile is then copied to another location using IO.copy_stream. Copying it using File.write and tempfile.read does not cause the issue.

Steps to reproduce the behavior

Please check out this repo: https://github.com/fishpercolator/reproduce-copystream and run the command in the README.

The output of this command on my env is:

open, local, File.write: 6744
open, local, IO.copy_stream: 6744

open, mounted, File.write: 6744
open, mounted, IO.copy_stream: 6744

tempfile + copy_file, local, File.write: 6744
tempfile + copy_file, local, IO.copy_stream: 6744

tempfile + copy_file, mounted, File.write: 6744
tempfile + copy_file, mounted, IO.copy_stream: 0

tempfile + copy_file + noop utime, mounted, File.write: 6744
tempfile + copy_file + noop utime, mounted, IO.copy_stream: 6744

tempfile + copy_file + noop chmod, mounted, File.write: 6744
tempfile + copy_file + noop chmod, mounted, IO.copy_stream: 6744

As you can see, the file is copied successfully if it is:

  1. copied directly from the mounted filesystem (not via a tempfile)
  2. copied via a tempfile from the container's own filesystem
  3. copied via a tempfile that has been touched with utime or chmod after the contents were inserted
  4. I do this same operation on an Ubuntu machine running Docker 19.03.9 ... i.e. it seems to be specific to something about my development machine

If any of the 4 statements above are true, the file copies fine. If all 4 are false, the file is created but it has a 0-byte size.

I've checked to see if the source filesystem of the mounted volume makes any difference but I get the same effect with eCryptfs, ext4 and tmpfs mountpoints.

Likewise I've tried different versions of Ruby to see if it was a regression in Ruby, but I can still reproduce the issue with the official Docker images of much older versions where I was certain it worked.

Does anyone have any ideas? Can anyone else reproduce this using my repo?

Output of docker version:

Client:
 Version:           19.03.9-ce
 API version:       1.40
 Go version:        go1.14.3
 Git commit:        9d988398e7
 Built:             Tue May 19 22:11:18 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.9-ce
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.14.3
  Git commit:       9d988398e7
  Built:            Tue May 19 22:11:01 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.3.4.m
  GitCommit:        d76c121f76a5fc8a462dc64594aea72fe18e1178.m
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Output of docker info:

Client:
 Debug Mode: false

Server:
 Containers: 317
  Running: 0
  Paused: 0
  Stopped: 317
 Images: 415
 Server Version: 19.03.9-ce
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: d76c121f76a5fc8a462dc64594aea72fe18e1178.m
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.6.13-arch1-1
 Operating System: Arch Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.53GiB
 Name: hulat
 ID: G7VK:VZOA:T4LI:ATNZ:G55B:TDEA:23J2:YM2A:MOLK:U5QY:5MVT:CU7P
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: pedanticgit
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.)

It's a physical Dell XPS-13 9360 running Arch Linux. Arch doesn't have version numbers but all packages are up to date. Docker was installed from the Arch package with is built using this PKGBUILD script from the official Docker sources.

sarna commented 4 years ago

I can reproduce this on current Manjaro (5.6.12-1-MANJARO), getting the exact same output. docker version output:

Client:
 Version:           19.03.8-ce
 API version:       1.40
 Go version:        go1.14.1
 Git commit:        afacb8b7f0
 Built:             Thu Apr  2 00:04:36 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.8-ce
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.14.1
  Git commit:       afacb8b7f0
  Built:            Thu Apr  2 00:04:16 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.3.4.m
  GitCommit:        d76c121f76a5fc8a462dc64594aea72fe18e1178.m
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

docker info output:

Client:
 Debug Mode: false

Server:
 Containers: 6
  Running: 0
  Paused: 0
  Stopped: 6
 Images: 37
 Server Version: 19.03.8-ce
 Storage Driver: overlay2
  Backing Filesystem: <unknown>
  Supports d_type: true
  Native Overlay Diff: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: d76c121f76a5fc8a462dc64594aea72fe18e1178.m
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.6.12-1-MANJARO
 Operating System: Manjaro Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.53GiB
 Name: mainframe
 ID: 2Y5C:436L:YDBV:YUH6:AW2M:EJ56:TIUI:I4ZO:ACZF:M64N:OQAG:LQBQ
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
thaJeztah commented 4 years ago

Both Arch linux and Manjaro are distributions that are not officially supported. Packages distributed by those distributions are maintained by those distributions and use differences in both the build process, build-time dependencies, as well as different versions of (e.g.) containerd and runc.

I'd recommend opening a ticket in the corresponding distribution's issue tracker for that reason.

I'm closing this issue because of the above, but feel free to continue the conversation

kthibodeaux commented 4 years ago

@pedantic-git have you resolved this? did you create an issue elsewhere that I can follow? I'm seeing the same issue with the official ruby 2.6 docker image on an up to date manjaro install (updating an existing install doesnt seem to trigger the error)

pedantic-git commented 4 years ago

@kthibodeaux I didn't create an issue anywhere else because I didn't have time to test it with the official build of Docker - have you been able to do that? It seems like even though it's happening to everyone on Arch/Manjaro it won't be supported here if it's only the Arch build of Docker.

ttrmw commented 3 years ago

Did either of you get anywhere with this @pedantic-git @kthibodeaux? Got another example here, also on Arch.

pedantic-git commented 3 years ago

Sadly not. I never got past the "Arch isn't supported" message above. I guess taking it up with the maintainers of the Arch package is the way to go.

Looks like that's Morten Linderud foxboron@archlinux.org if you'd like to raise a case and copy me in (quinn at fishpercolator.co.uk) but I'm not really even sure how one goes about doing that.

ahwatts commented 3 years ago

I'm able to reproduce this on Fedora 33 using the package moby-engine-19.03.13-1.ce.git4484c46.fc33.x86_64.

I get the exact same output as @pedantic-git

My docker info:

awatts@ironic-mullet reproduce-copystream [master=] $ docker info
Client:
 Debug Mode: false

Server:
 Containers: 23
  Running: 21
  Paused: 0
  Stopped: 2
 Images: 64
 Server Version: 19.03.13
 Storage Driver: overlay2
  Backing Filesystem: btrfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: journald
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: /usr/libexec/docker/docker-init
 containerd version: 
 runc version: c9a9ce0286785bef3f3c3c87cd1232e535a03e15
 init version: 
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.10.8-200.fc33.x86_64
 Operating System: Fedora 33 (Workstation Edition)
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.09GiB
 Name: ironic-mullet.eminorinc.com
 ID: SHFE:UOOW:MKUA:56MY:UHUL:QTQP:YZJX:C43V:XR4R:67RF:BWKS:22EM
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: true
pedantic-git commented 3 years ago

@thaJeztah Now that this issue can be reproduced in an official build (albeit of Moby Engine rather than CE) is it possible this issue could warrant reopening?

AlexVKO commented 3 years ago

@thaJeztah any chance to reopen this issue?

Same issue here(on ubunto 20.04), running images Ubuntu:18:04, Ubuntu:20.04, and Ruby:3.0.0.

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.0-docker)

Server:
 Containers: 40
  Running: 6
  Paused: 0
  Stopped: 34
 Images: 485
 Server Version: 20.10.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.8.0-43-generic
 Operating System: Ubuntu 20.04.1 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 128
 Total Memory: 125.7GiB
 Name: alexvko
 ID: N5L3:NVEW:VSYS:OSK6:JWAO:T6ZE:APR5:5XVM:OTMQ:BXSZ:DNBV:STUU
 Docker Root Dir: /mnt/md0/docker
 Debug Mode: false
 Username: alexvko
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No blkio weight support
WARNING: No blkio weight_device support
bmhughes commented 3 years ago

Same here on Fedora 33

makmic commented 3 years ago

I just encountered the same issue and hope that it will be reopened and addressed here as it was far from easy to debug.

In the meantime I'll have to add this (Ruby) monkey patch to my docker setup:

module FileUtils
  class Entry_
    def copy_file(dest)
      Open3.capture2('cp', path(), dest)
    end
  end
end
pedantic-git commented 3 years ago

@makmic There's actually no need to shell out to a separate and heavyweight cp binary to work around this issue. Even doing a no-op chmod or utime on the file before it is copied as in my example prevents this issue from occurring. Why, I have no idea, but it reliably does!

ngoan98tv commented 3 years ago

I'm facing to the same issue on CentOS.

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)

Server:
 Containers: 24
  Running: 22
  Paused: 0
  Stopped: 2
 Images: 51
 Server Version: 20.10.5
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.18.0-240.15.1.el8_3.x86_64
 Operating System: CentOS Linux 8
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.65GiB
 Name: LMS-Server
 ID: DRNO:6ONC:XO6P:6ZAK:W6NI:R764:GT5Q:YTZD:QOEB:Y63W:JD6N:TW4Y
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
dwarburt commented 3 years ago

@makmic There's actually no need to shell out to a separate and heavyweight cp binary to work around this issue. Even doing a no-op chmod or utime on the file before it is copied as in my example prevents this issue from occurring. Why, I have no idea, but it reliably does!

@pedantic-git I don't suppose you have an example no-op chmod or utime that's effective? and do you think they'd be faster than my current workaround?

module FileUtilsPatch
  def copy_file(dest)
    FileUtils.touch(path())
    super
  end
end

module FileUtils
  class Entry_
    prepend FileUtilsPatch
  end
end

Could this be an interaction with LUKS Full Disk Encryption? That's the main disk related difference on the affected system I have, when the same project is fine on all others.

pedantic-git commented 3 years ago

@dwarburt Have you seen the example repo in the original post? The last two tests are the no-op ones. They're basically the same as your touch example apart from that they set the times/mode back to their original values rather than updating them to new values. Hope that makes sense! https://www.github.com/fishpercolator/reproduce-copystream/tree/master/repro.rb

pimpin commented 3 years ago

I tried the reproduct-copystream in my old laptop and could not reproduce the bug. But I ran the same in my new laptop and then bug appears.

Working config :

docker info
Client:
 Debug Mode: false
Server:
 Containers: 23
  Running: 6
  Paused: 0
  Stopped: 17
 Images: 400
 Server Version: 19.03.1
 Storage Driver: aufs
  Root Dir: /var/lib/docker/aufs
  Backing Filesystem: extfs
  Dirs: 693
  Dirperm1 Supported: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
 runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 4.15.0-126-generic
 Operating System: Ubuntu 18.04.5 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.676GiB
 Name: tim-XPS-13
 ID: 5MUF:2FRA:A7CV:Z4QN:5TYH:PB3S:U4S7:SHP6:ARHK:UXPL:HBQ6:HJM7
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: tzmachine
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
WARNING: the aufs storage-driver is deprecated, and will be removed in a future release.

Failing config :

docker info
Client:
 Debug Mode: false

Server:
 Containers: 7
  Running: 7
  Paused: 0
  Stopped: 0
 Images: 20
 Server Version: 19.03.13
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8fba4e9a7d01810a393d5d25a3621dc101981175
 runc version: 
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.8.0-50-generic
 Operating System: Ubuntu Core 18
 OSType: linux
 Architecture: x86_64
 CPUs: 12
 Total Memory: 15.27GiB
 Name: pierrem-XPS-15-9500
 ID: 6CXC:YEKH:QWH2:NJ7Z:A566:IVCN:HSEQ:4KTV:MSCN:2NJU:U2GU:ZWDX
 Docker Root Dir: /var/snap/docker/common/var-lib-docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

This was also failing with Docker version 20.10.6, build 370c289 in my new laptop.

pimpin commented 3 years ago

So I thought it might be a regression appeared after 19.13.1, but a team mate made the test and does not have the bug with 20.10.6. So it's not only related to a docker version.

working config :

reproduce-copystream ➜   docker info                                                                                                                                      (master)
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)
Server:
 Containers: 15
  Running: 0
  Paused: 0
  Stopped: 15
 Images: 47
 Server Version: 20.10.6
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runtime.v1.linux runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 4.15.0-29-generic
 Operating System: Linux Mint 18.3
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.18GiB
 Name: T480
 ID: U56P:745X:PPSO:XKPL:W5SN:T3EC:FZVJ:JQX6:DFEQ:OO5M:4LTG:A7CC
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
WARNING: No swap limit support
mkleint commented 3 years ago

Several our Bitbucket Pipelines customers are facing the same issue after our FlatCarOS upgrade 2605.12.0 to 2765.2.2 (kernel move from 5.4.92 directly to 5.10.25) it mostly appears as ActiveStorage::IntegrityError while running rspec tests

docker info
Client:
 Debug Mode: false
Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 20.10.5
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Authorization: pipelines
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
WARNING: API is accessible on http://0.0.0.0:2375 without encryption.
         Access to the remote API is equivalent to root access on the host. Refer
         to the 'Docker daemon attack surface' section in the documentation for
         more information: https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface
 Init Binary: docker-init
 containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  userns
 Kernel Version: 5.10.25-flatcar
 Operating System: Alpine Linux v3.13 (containerized)
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 30.91GiB
 Name: 44212645-d3ba-4afa-b2d9-8a1f81e7f1ee-wrmh2
 ID: KPMN:TGRY:QL5G:YYZT:4ZHE:XLQD:BCPI:UEUZ:6KD7:TGB4:GUVN:KAIB
 Docker Root Dir: /var/lib/docker/165536.165536
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Registry Mirrors:
  http://10.203.197.49:5000/
 Live Restore Enabled: false
 Product License: Community Engine
Joanna-Zhang commented 3 years ago

I ran into this same problem with ruby's FileUtil.cp on MacOS AMD64 using ruby:2.5.3 as the base image for the container.

touching the file after it was copied worked to get around it.

I was also able to work around it without changing source or spec code by mounting the /tmp directory (the destination of the file copied from a mounted directory) as a tmpfs upon container startup (could also be set through the tmpfs key in a docker-compose file):

--mount type=tmpfs,destination=/tmp

stephen-turner commented 3 years ago

Reopening as this is being seen in supported configurations.

bdwyertech commented 3 years ago

Seeing this on docker for Mac as well. Very strange. I can reproduce on one image, but not another.

ZimbiX commented 3 years ago

I first came across this issue on Arch on 2020-05-28 in a failing RSpec test using ActiveStorage, returning an ActiveStorage::IntegrityError.

@maddymarkovitz and I did much investigation on this in the following months. With strace, we discovered it was related to the copy_file_range syscall, as recently noted by @amanfredi in the Docker for Mac issue. By testing many different kernel versions in Qemu, we determined it to be a bug in the Linux kernel that was introduced in Linux 5.6.0-arch1-1.

The bug probably resides in the overlay2 subsystem, as it was changed in that release, and from my results of trying out different Docker storage drivers:

Enabling/disabling overlay2's 'Native Overlay Diff' didn't change the outcome.

We put together a minimal C file reproducing the bug by executing the copy_file_range syscall, and a test script that compiles and runs it, copying from a mounted file (resulting in failed copying due to the bug) and copying from a previously copied-in file (resulting in successful copying).

Source:

#include <unistd.h>
#include <sys/syscall.h>
#include <fcntl.h>

int main(int argc, char *argv[]) {
  int input = open(argv[1], O_RDONLY);
  int temp1 = open("/tmp/copy_file_range_test", O_WRONLY|O_CREAT|O_TRUNC, 0100644);

  syscall(SYS_copy_file_range, input, NULL, temp1, NULL, 6, 0);

  close(input);
  close(temp1);

  int temp2 = open("/tmp/copy_file_range_test", O_RDONLY);
  int dest = open("destination", O_WRONLY|O_CREAT|O_TRUNC, 0100644);

  syscall(SYS_copy_file_range, temp2, NULL, dest, NULL, 6, 0);

  close(temp2);
  close(dest);
}

Test script:

#!/bin/bash

set -Eeuo pipefail

source=${1-Gemfile}

check() {
  if wc -c destination | grep -E '^0 ' > /dev/null; then
    echo 'Copy failed'
  else
    echo OK
  fi
  echo
}

printf "FROM debian:10.8-slim\nRUN apt update && apt install -y gcc strace" > /tmp/strace_Dockerfile
docker build -f /tmp/strace_Dockerfile -t strace .
echo

echo Local:
(
  # set -x
  gcc test.c
  strace ./a.out "$source" &> test-strace
)
check

echo Docker - mounted:
docker run --rm -v "$(pwd):/work" -w /work strace bash -c "
  #set -x &&
  gcc test.c &&
  strace ./a.out '$source' &> test-strace-docker-mounted
"
check

echo Docker - copied:
docker run --rm -v "$(pwd):/work" -w /work strace bash -c "
  #set -x &&
  cp '$source' /x &&
  gcc test.c &&
  strace ./a.out /x &> test-strace-docker-copied
"
check

Bug output:

➜ ./ctest.sh                                 
[+] Building 2.3s (6/6) FINISHED                                                                                                                                                                                                                                                  
 => [internal] load build definition from strace_Dockerfile                                                                                                                                                                                                                  0.0s
 => => transferring dockerfile: 115B                                                                                                                                                                                                                                         0.0s
 => [internal] load .dockerignore                                                                                                                                                                                                                                            0.0s
 => => transferring context: 2B                                                                                                                                                                                                                                              0.0s
 => [internal] load metadata for docker.io/library/debian:10.8-slim                                                                                                                                                                                                          2.2s
 => [1/2] FROM docker.io/library/debian:10.8-slim@sha256:13f0764262a064b2dd9f8a828bbaab29bdb1a1a0ac6adc8610a0a5f37e514955                                                                                                                                                    0.0s
 => CACHED [2/2] RUN apt update && apt install -y gcc strace                                                                                                                                                                                                                 0.0s
 => exporting to image                                                                                                                                                                                                                                                       0.0s
 => => exporting layers                                                                                                                                                                                                                                                      0.0s
 => => writing image sha256:93c8b4395b5e2b3d5e0e7510d18201c4d94afab419415b5541b8852781ed0442                                                                                                                                                                                 0.0s
 => => naming to docker.io/library/strace                                                                                                                                                                                                                                    0.0s

Local:
OK

Docker - mounted:
Copy failed

Docker - copied:
OK

We had meant to report the bug to the Linux project, but never quite got around to it. Has anyone else done so already?

Running the script today while writing this (on 5.12.2-arch1-1), I am surprised to see that it actually passes! Maybe Linux fixed the bug recently...

Local:
OK

Docker - mounted:
OK

Docker - copied:
OK

Our colleagues using Mac were not seeing this issue at the time. Now however, Mac people would be seeing this bug too, due to Docker for Mac updating the Linux kernel version used in the VM (to a version above Linux 5.6.0). In Docker for Mac 3.0.0 the Linux kernel sees a massive upgrade, to 5.10.25.

leomao10 commented 3 years ago

Our team also found a comment in Rust stdlib about the issue of copy_file_range returns 0 and they mentioned overlay.

This is the related PR: https://github.com/rust-lang/rust/pull/75428

This is the related issue: https://github.com/rust-lang/rust/issues/75446

Maybe it would help to identify the impact of the issue

esoterick commented 3 years ago

Our team using both Mac and Linux have been seeing this issue while using test-kitchen inside docker containers to test our chef cookbooks. As @ZimbiX mentioned the bug seems to have been introduced in 5.6 kernel, it however did not seem to be present in kernel versions >= 5.11. Users originally saw the following errors which I am including below incase anyone else hits them as this issue was pretty rough to track down.

Chef::Exceptions::JSON::ParseError:
Top level JSON object must be a Hash or Array. (actual: NilClass)
[2021-05-05T19:35:36+00:00] INFO: HTTP Request Returned 404 Not Found: Object not found:
resolving cookbooks for run list: []
[2021-05-05T19:35:37+00:00] ERROR: #<TypeError: no implicit conversion of nil into Hash>
/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-zero-5.3.2/lib/chef_zero/chef_data/cookbook_data.rb:76:in `merge!'
/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-zero-5.3.2/lib/chef_zero/chef_data/cookbook_data.rb:76:in `from_json'
/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-zero-5.3.2/lib/chef_zero/chef_data/cookbook_data.rb:34:in `metadata_from'
/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-12.21.14/lib/chef/chef_fs/chef_fs_data_store.rb:828:in `get_single_cookbook_version'
/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-12.21.14/lib/chef/chef_fs/chef_fs_data_store.rb:550:in `list'
/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-zero-5.3.2/lib/chef_zero/data_store/v1_to_v2_adapter.rb:83:in `block in list'
/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-zero-5.3.2/lib/chef_zero/data_store/v1_to_v2_adapter.rb:113:in `fix_exceptions'
/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-zero-5.3.2/lib/chef_zero/data_store/v1_to_v2_adapter.rb:82:in `list'
/opt/chef/embedded/lib/ruby/gems/2.3.0/gems/chef-zero-5.3.2/lib/chef_zero/rest_base.rb:100:in `list_data'

While test-kitchen was staging the files to be sent to the remote vm, it created a bunch of 0-byte cookbook files and json files. Depending on what we we're testing we hit the above different errors.

We ended up using the following monkeypatch in our testing environment's docker container to workaround the issue.

# frozen_string_literal: true

require 'fileutils'

module FileUtils
  class Entry_
    def copy_file(dest)
      File.open(path()) do |s|
        File.open(dest, 'wb', s.stat.mode) do |f|
          IO.copy_stream(s, f)
          f.chmod f.lstat.mode
        end
      end
    end
  end
end

We also put guards around it so it would only be loaded on the problematic kernels, so in the future in case we forgot about it... it wouldn't bite us...

amichal commented 3 years ago

Perhaps a typo above? i couldn't get the above patch to work but was able to by doing the no-op chmod on the source s IO before the copy stream. So like this:

          s.chmod s.lstat.mode
          IO.copy_stream(s, f)

The fixed it for both our Github Action based test running in a ruby:2.7.3-alpine on their unbuntu:latest runner as well as in the https://github.com/fishpercolator/reproduce-copystream checkout on MacOS 11.3.1 intel MBP running Docker Desktop 3.3.3. Interestingly, I wasn't previously able to reproduce this bug running our apps tests inside a container on that machine. Maybe due to us mounting a tmpfs from docker-compose. We dont do that in GH actions

rietta commented 3 years ago

My team reports the same issue on Ubuntu Linux and Mac OS hosts, with the official Docker release.

esoterick commented 3 years ago

@amichal No, that was not a typo it did not seem to resolve the issue when I used your snippet.

This is on OSX 11.4 inside a ruby:2.6.0 container which is based off of buildpack-deps:buster.

erikaxel commented 2 years ago

To future people reading this: We had this same issue on Ubuntu 20.04 with kernel 5.4.0-89, downgrading to 5.4.0-88 helped.

gpakosz commented 2 years ago

I'm hitting this with the ruby:bullseye Docker image. From time to time FileUtils.cp(from, to) (this is Ruby) creates an empty destination file when copying from /tmp to a mounted volume.

I'm hitting this on macOS but the https://github.com/docker/for-mac/issues/5570 issue is closed πŸ€·β€β™‚οΈ

I'm using Docker Desktop 4.1.1 (69879).

Within the container,

# uname -a
Linux 81347a0746bd 5.10.47-linuxkit #1 SMP Sat Jul 3 21:51:47 UTC 2021 x86_64 GNU/Linux

Mounting /tmp with tmpfs works around the problem for me

stanhu commented 2 years ago

I first validated that Linux v5.10 did NOT pass the reproduction test in https://github.com/docker/for-linux/issues/1015#issuecomment-841915668:

root@stanhu-kernel-test2:/tmp$ uname -a
Linux stanhu-kernel-test2 5.10.0 #3 SMP Wed Dec 1 16:42:21 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

root@stanhu-kernel-test2:/tmp# bash test.sh
Sending build context to Docker daemon  14.34kB
Step 1/2 : FROM debian:10.8-slim
 ---> 115566c891d1
Step 2/2 : RUN apt update && apt install -y gcc strace
 ---> Using cache
 ---> d562d525045d
Successfully built d562d525045d
Successfully tagged strace:latest

Local:
OK

Docker - mounted:
Copy failed

Docker - copied:
OK

Then I applied the following kernel patch, and it now passes:

https://lore.kernel.org/linux-fsdevel/20201207163255.564116-6-mszeredi@redhat.com/#t

root@stanhu-kernel-test2:/tmp# uname -a
Linux stanhu-kernel-test2 5.10.0+ #5 SMP Wed Dec 1 20:58:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

root@stanhu-kernel-test2:/tmp# bash test.sh
Sending build context to Docker daemon  14.34kB
Step 1/2 : FROM debian:10.8-slim
 ---> 115566c891d1
Step 2/2 : RUN apt update && apt install -y gcc strace
 ---> Using cache
 ---> d562d525045d
Successfully built d562d525045d
Successfully tagged strace:latest

Local:
OK

Docker - mounted:
OK

Docker - copied:
OK

This patch landed in Linux v5.11 via this merge commit: https://github.com/torvalds/linux/commit/92dbc9dedccb9759c7f9f2f0ae6242396376988f (commit: https://github.com/torvalds/linux/commit/82a763e61e2b601309d696d4fa514c77d64ee1be)

I've reported the issue on the mailing list and asked whether this patch could be backported: https://marc.info/?l=linux-fsdevel&m=163839424404401&w=2.

stanhu commented 2 years ago

Sent a request for backporting the fixes into Linux 5.10: https://lore.kernel.org/stable/CAMBWrQ=1MKxnMT_6Jnqp_xxr7psVywPBJc6p1qCy9ENY8RF2Qw@mail.gmail.com/T/#u

I've validated that torvalds/linux@1a980b8cbf0059a5308eea61522f232fd03002e2 introduced this regression in Linux v5.6.x.

thaJeztah commented 2 years ago

Thank you, @stanhu !

stanhu commented 2 years ago

Per the latest update on the kernel stable mailing list, the kernel backport fix for 5.10 has been queued for review and should land in the stable-5.10.y branch soon.

To avoid this bug, I'd suggest avoid Linux v5.6.0 - v5.10.83 for now, unless you backport the two patches below:

  1. torvalds/linux@82a763e61e2b601309d696d4fa514c77d64ee1be
  2. torvalds/linux@9b91b6b019fda817eb52f728eb9c79b3579760bc

I've confirmed that 5.5.19 does NOT have this bug, and it was introduced in 5.6.0 via torvalds/linux@1a980b8cbf0059a5308eea61522f232fd03002e2.

I looked into submitting backports for the 5.6 to 5.9 branches, but it appears they are no longer maintained.

UPDATE: Linux 5.10.84 has been tagged with the backported fixes. Other distributions that use older kernel versions are also still affected. For example:

  1. Ubuntu 18.04
  2. RedHat Enterprise Linux 8 / CentOS 8
stanhu commented 2 years ago

It seems some distributions that run a 5.4 kernel, such as Ubuntu 18.04.6, are also affected because they have pulled in https://github.com/torvalds/linux/commit/1a980b8cbf0059a5308eea61522f232fd03002e2:

stanhu@stanhu-ubuntu-18:/tmp$ grep ovl_splice linux-gcp-5.4-5.4.0/fs/overlayfs/file.c
static ssize_t ovl_splice_read(struct file *in, loff_t *ppos,
ovl_splice_write(struct pipe_inode_info *pipe, struct file *out,
    .splice_read    = ovl_splice_read,
    .splice_write   = ovl_splice_write,

I've reported this to Ubuntu in https://bugs.launchpad.net/ubuntu/+source/linux-base/+bug/1953199. I'd suggest others run the reproduction step in https://github.com/docker/for-linux/issues/1015#issuecomment-841915668 and report the bugs to the distribution maintainers.

stanhu commented 2 years ago

CentOS 8 also has this bug even though it's shipping with a 4.18 kernel. I reported it here: https://bugs.centos.org/view.php?id=18370

RedHat Enterprise Linux 8.3 also has this problem: https://bugzilla.redhat.com/show_bug.cgi?id=2028998

stanhu commented 2 years ago

Linux v5.10.84 has now been tagged with the fixes for the overlay filesystem:

  1. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=dca4f9a581967a81cf4420bdd26162112d546031
  2. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=bcce010f92308368ac27248342e6ccd058e8c596
thaJeztah commented 2 years ago

Thanks for the update, @stanhu - we added an internal ticket to either patch the kernel version as used by Docker Desktop, or to wait for the upstream kernel to include the fix; having it merged in upstream definitely makes it easier πŸ‘

/cc @fredericdalleau @djs55 FYI

smartygus commented 2 years ago

Hi @thaJeztah,

I see that an internal ticket was opened to fix this kernel issue. I just wanted to ask if this has been planned and if there's an ETA? I've been holding off on updating Docker for Mac for a long time now (due to this issue, which is caused by the underlying kernel problem), and would love to be able to update without having to find a workaround.

Thanks heaps for your help!

emanuelhfarias commented 2 years ago

@smartygus Kernel 5.10.105 has been released fixing this bug! Docker for Mac released 4.6.0 version that includes Kernel 5.10.104. I confirm this version solves the problem in Docker for Mac!

smartygus commented 2 years ago

@emanuelhfarias this is great news!! Thanks for the heads up! :)

stanhu commented 2 years ago

Yes, I think we can close this issue. I see that upgrading Docker for Mac to v4.6.0 upgraded the kernel to 5.10.104-linuxkit from 5.10.76-linuxkit.

stanhu commented 2 years ago

Other related distribution notes:

stanhu commented 2 years ago

According to https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8-beta/html/8.6_release_notes/new-features#enhancement_kernel, RedHat 8.6 will be released on May 22, 2022 with kernel-4.18.0-369, so this should fix it. RedHat 8.5 still has this issue.