ehough / docker-nfs-server

A lightweight, robust, flexible, and containerized NFS server.
https://hub.docker.com/r/erichough/nfs-server/
GNU General Public License v3.0
669 stars 221 forks source link

Problem running on ext4+overlay2 docker data directory #37

Closed andyneff closed 4 years ago

andyneff commented 4 years ago

I've run into a strange problem with this docker, and I've yet to figure out what is going on. For some strange reason I can no longer successfully use this docker (when I could several kernels ago. I'm still using the same docker-compose I had ~ a year and a half ago).

I'm out of ideas, I cannot figure out how the same image (the SHAs match) can work with one docker data directory and not the other. What am I missing to make this container work?

Observations

Not working

When I run it in my normal docker data folder I get:

Using /var/lib/docker:

# docker-compose down
Removing nfs_nfs_1 ... done
Removing network nfs_default
# docker-compose up
... (everything starts up the same)
# ls /mnt/data

# mount -vvv XservernameX:/nfs /mnt/data
mount.nfs: timeout set for Mon Feb 10 12:30:58 2020
mount.nfs: trying text-based options 'vers=4.2,addr=10.XX.XX.59,clientaddr=10.XX.XX.59'
mount.nfs: mount(2): No such file or directory
mount.nfs: trying text-based options 'addr=10.XX.XX.59'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: portmap query retrying: RPC: Program not registered
mount.nfs: prog 100003, trying vers=3, prot=17
mount.nfs: portmap query failed: RPC: Program not registered
mount.nfs: mounting XservernameX:/nfs failed, reason given by server: No such file or directory

Clean docker dir Working

However, I accidentally discovered that if I change the docker data dir, and restart the daemon, it works all of a sudden 😮

Using /var/lib/docker2

# docker-compose down
Removing nfs_nfs_1 ... done
Removing network nfs_default
# docker-compose up
... (everything starts up the same)
# ls /mnt/data

# mount -vvv XservernameX:/nfs /mnt/data
mount.nfs: timeout set for Mon Feb 10 12:24:51 2020
mount.nfs: trying text-based options 'vers=4.2,addr=10.XX.XX.59,clientaddr=10.XX.XX.59'
# ls /mnt/data
foobar
# umount -vvv /mnt/data
/mnt/data: nfs4 mount point detected
/mnt/data: umounted

Here is the docker-compose file I'm using

version: '2.3'
services:
  nfs:
    image: erichough/nfs-server
    ports:
      - '2049:2049'
    cap_add:
      - SYS_ADMIN
      - SYS_MODULE
    volumes:
      - type: bind
        source: /opt/nfs_test
        target: /nfs
        read_only: false
      - type: bind
        source: /lib/modules
        target: /lib/modules
        read_only: true
    environment:
      - NFS_EXPORT_0=/nfs *(rw,insecure,no_subtree_check,fsid=1,no_root_squash,async)
      - NFS_DISABLE_VERSION_3=1

Other things I tried that did not work

A workaround

Now this is surprising, but I discovered that I could:

  1. Start the nfs container with /var/lib/docker2
  2. Mount /mnt/data. Success
  3. Leave the dir mounted, and stop the docker
  4. Switch the docker daemon back to /var/lib/docker
  5. Start the nfs container with /var/lib/docker
  6. ls /mnt/data works using the /var/lib/docker now.

This is not a great work around, but it seems to me to present even more questions than answers.

Other notes

andyneff commented 4 years ago

Update

Everything is the same.

ehough commented 4 years ago

It sounds like there's some leftover state, somewhere, that's interfering. I'll test on a 5.4 kernel to see if I can reproduce (I'm running 4.19). Other things that I would recommend that you test:

  1. Stop erichough/nfs-server, if it's running, then check for any lingering NFS-related processes. e.g. ps aux | grep -iE "rpc|nfs". Anything interesting show up?
  2. docker system prune -a --volumes. Does that change the behavior?

mount.nfs: trying text-based options 'vers=4.2,addr=10.XX.XX.59,clientaddr=10.XX.XX.59' mount.nfs: mount(2): No such file or directory

That sounds like the container can't locate/access the /nfs directory. You could probably snoop around in your host's /var/lib/docker directory to see if anything in there looks suspect. Or maybe compare it to /var/lib/docker2 to see if anything stands out?

andyneff commented 4 years ago

It does "appear" as if the /var/lib/docker version cannot access my /nfs, however the workaround shows that if I mount with the container from /var/lib/docker2, stop that container, and start the container from /var/lib/docker, that now it can access it.

I stop the container, and rm the exited container to make sure I remove what I normally know as state.

I've tried pruning everything, no success. I moved on and I just deleted 100% of all images, containers, networks (except the three, bridge, host, and none), and build cache, re-pulled the image, and it STILL does not work. Even though this nfs container is the Only problem I have, I'm beginning to suspect /var/lib/docker is corrupt somehow. Now that I've deleted everything, I can compare it to a clean one and see what is different.

I'll let you know if/when I find anything else out

andyneff commented 4 years ago

@ehough I think I found out what is different

Wow... It turns out we had a very similar conversation a year ago (#2), and I totally forgot about it (and probably messed up my docker-compose file back then...)

/var/lib/docker is ext4 /var/lib/docker2 is btrfs

There was nothing corrupt in /var/lib/docker, it just has different rules for different fs?

So I guess, in a way we learned a little more about this issue.

If my docker daemon data dir is btrfs, fsid=1 and fsid=0 work

If my docker daemon data dir is ext4+overlay2, only fsid=0 works

On the host, this can be checked via docker info -f '{{json .Driver}}', in the ext4 cast this is overlay2 and docker info -f '{{index .DriverStatus 0 1}}' says extfs

 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true

In the container...

mount | sed -En 's:^.* on / (type |\()([^, ]*).*:\2:p'

Will return btrfs for btfrs, and overlay for the ext4+overlay2 case

/dev/sdc3 on / type btrfs (rw,seclabel,relatime,ssd,space_cache,subvolid=6485,subvol=/root/var/lib/docker2/btrfs/subvolumes/f905ea7d87 ...

vs

overlay on / type overlay (rw,seclabel,relatime,lowerdir=/var/lib/docker/overlay2/l/BHFRM7TBEU ...

So.... Assuming someone can replicate this result, If you are so inclined, you could add a check init_exports to check if overlay, and print out a warning if fsid=0 is not set?

Keep in mind., I have no idea what any of the other storage drivers need, aufs, etc....

ehough commented 4 years ago

Good find! I also completely forgot our convo from last year. That tricky fsid parameter strikes again.

So.... Assuming someone can replicate this result, If you are so inclined, you could add a check init_exports to check if overlay, and print out a warning if fsid=0 is not set?

That's a great idea. I'll open a new ticket to track this.

Thanks again for your investigative work, and please don't hesitate to reach out again if you hit other obstacles.