elasticluster / elasticluster

Create clusters of VMs on the cloud and configure them with Ansible.
http://elasticluster.readthedocs.io/
GNU General Public License v3.0
335 stars 150 forks source link

NFS client mounts fail #357

Open mbookman opened 7 years ago

mbookman commented 7 years ago

This is a re-occurrence of Issue #253.

My gridengine cluster on Debian 8 had failed to initialize - I noted this in my comment on commit 806eef7 ("Job type reload is not applicable for unit netfilter-persistent.service").

I looked to ignore that issue and proceed, so I then ran:

$ elasticluster -v -v -v setup gridengine

which failed with:

TASK [nfs-client : add to /etc/fstab] ******************************************
task path: /usr/local/google/home/mbookman/.python-eggs/elasticluster-1.3.dev0-py2.7.egg-tmp/elasticluster/share/playbooks/roles/roles/nfs-client/tasks/nfsmount.yml:8
fatal: [compute002]: FAILED! => {"changed": false, "failed": true, "msg": "Error mounting /home: mount.nfs: Connection timed out\n"}
fatal: [compute001]: FAILED! => {"changed": false, "failed": true, "msg": "Error mounting /home: mount.nfs: Connection timed out\n"}
    to retry, use: --limit @/usr/local/google/home/mbookman/.python-eggs/elasticluster-1.3.dev0-py2.7.egg-tmp/elasticluster/share/playbooks/site.retry

I connected to the frontend instance and could see that /etc/exports seemed to be properly updated:

mbookman@frontend001:~$ tail -n 2 /etc/exports 
/home compute001(rw,no_root_squash,async) compute002(rw,no_root_squash,async)
/usr/share/gridengine/default/common compute001(rw,no_root_squash) compute002(rw,no_root_squash) frontend001(rw,no_root_squash)

But it doesn't seem like the nfs server start completed or picked up the exports:

mbookman@frontend001:~$ sudo showmount -e localhost
clnt_create: RPC: Program not registered

mbookman@frontend001:~$ sudo /etc/init.d/nfs-kernel-server status
● nfs-kernel-server.service - LSB: Kernel NFS server support
   Loaded: loaded (/etc/init.d/nfs-kernel-server)
   Active: active (exited) since Thu 2016-12-01 00:22:13 UTC; 23min ago

Dec 01 00:22:13 frontend001 nfs-kernel-server[21774]: Not starting NFS kernel...
Dec 01 00:22:13 frontend001 systemd[1]: Started LSB: Kernel NFS server support.
Hint: Some lines were ellipsized, use -l to show in full.

I could see that nfs-server/tasks/main.yml tries to reload the NFS exports:

- name: Reload NFS exports file
  shell:
    exportfs -r

But this had generated a bunch of errors. More clear here running it directly:

mbookman@frontend001:~$ sudo exportfs -r
exportfs: /etc/exports [1]: Neither 'subtree_check' or 'no_subtree_check' specified for export "compute001:/home".
  Assuming default behaviour ('no_subtree_check').
  NOTE: this default has changed since nfs-utils version 1.0.x

exportfs: /etc/exports [1]: Neither 'subtree_check' or 'no_subtree_check' specified for export "compute002:/home".
  Assuming default behaviour ('no_subtree_check').
  NOTE: this default has changed since nfs-utils version 1.0.x

exportfs: /etc/exports [2]: Neither 'subtree_check' or 'no_subtree_check' specified for export "compute001:/usr/share/gridengine/default/common".
  Assuming default behaviour ('no_subtree_check').
  NOTE: this default has changed since nfs-utils version 1.0.x

exportfs: /etc/exports [2]: Neither 'subtree_check' or 'no_subtree_check' specified for export "compute002:/usr/share/gridengine/default/common".
  Assuming default behaviour ('no_subtree_check').
  NOTE: this default has changed since nfs-utils version 1.0.x

exportfs: /etc/exports [2]: Neither 'subtree_check' or 'no_subtree_check' specified for export "frontend001:/usr/share/gridengine/default/common".
  Assuming default behaviour ('no_subtree_check').
  NOTE: this default has changed since nfs-utils version 1.0.x

exportfs: gridengine-compute001.c.project.internal:/usr/share/gridengine/default/common: Function not implemented
exportfs: gridengine-compute002.c.project.internal:/usr/share/gridengine/default/common: Function not implemented
exportfs: gridengine-frontend001.c.project.internal:/usr/share/gridengine/default/common: Function not implemented
exportfs: gridengine-compute001.c.project.internal:/home: Function not implemented
exportfs: gridengine-compute002.c.project.internal:/home: Function not implemented

I manually restarted the nfs-kernel-server:

mbookman@frontend001:~$ sudo /etc/init.d/nfs-kernel-server restart
[ ok ] Restarting nfs-kernel-server (via systemctl): nfs-kernel-server.service.

and the exports look good.

mbookman@frontend001:~$ sudo showmount -e localhost
Export list for localhost:
/usr/share/gridengine/default/common gridengine-frontend001.c.project.internal,gridengine-compute002.c.project.internal,gridengine-compute001.c.project.internal
/home                                gridengine-compute002.c.project.internal,gridengine-compute001.c.project.internal

When I then re-run elasticluster setup gridengine, the cluster comes up.

riccardomurri commented 7 years ago

I think it's a slightly different issue, not a regression.

Apparently, the "close" reason for NFS server not starting is this, i.e., a non-existing cell directory::

exportfs: ...:/usr/share/gridengine/default/common: Function not implemented
exportfs: ...:/usr/share/gridengine/default/common: Function not implemented
exportfs: ...:/usr/share/gridengine/default/common: Function not implemented

However, the root problem seems to me that, in the transition from SGE 6.2 to SoGE 8.1.x, some paths were changed. Notably the (default) cell directory is now located in /var/lib/gridengine/default (on Debian; Ubuntu seems fine with the old defaults; CentOS/RHEL may follow yet another scheme).

Can you please try changing /usr/share/gridengine/default/common with /var/lib/gridengine/default/common in elasticluster/share/playbooks/roles/gridengine.yml and see if that fixes it for you on Debian 8?

mbookman commented 7 years ago

I agree this is not a regression - sorry I didn't mean to imply that - just to imply that the end result looks very similar.

I did try updating gridengine.yml, but still see the same problem:

TASK [nfs-server : Reload NFS exports file]

is failing in "exportfs -r" (though processing continues). Running exportfs manually on the frontend node:

mbookman@frontend001:~$ sudo exportfs -r
exportfs: /etc/exports [1]: Neither 'subtree_check' or 'no_subtree_check' specified for export "compute001:/home".
  Assuming default behaviour ('no_subtree_check').
  NOTE: this default has changed since nfs-utils version 1.0.x

exportfs: /etc/exports [2]: Neither 'subtree_check' or 'no_subtree_check' specified for export "compute001:/var/lib/gridengine/default/common".
  Assuming default behaviour ('no_subtree_check').
  NOTE: this default has changed since nfs-utils version 1.0.x

exportfs: /etc/exports [2]: Neither 'subtree_check' or 'no_subtree_check' specified for export "frontend001:/var/lib/gridengine/default/common".
  Assuming default behaviour ('no_subtree_check').
  NOTE: this default has changed since nfs-utils version 1.0.x

exportfs: gridengine-compute001.c.project.internal:/var/lib/gridengine/default/common: Function not implemented
exportfs: gridengine-frontend001.c.project.internal:/var/lib/gridengine/default/common: Function not implemented
exportfs: gridengine-compute001.c.project.internal:/home: Function not implemented

I can see that the NFS server is not running:

mbookman@frontend001:~$ ps -ef | grep nfs
root     21236     2  0 21:36 ?        00:00:00 [nfsiod]
mbookman 25947 25926  0 22:02 pts/0    00:00:00 grep nfs

mbookman@frontend001:~$ sudo /etc/init.d/nfs-kernel-server restart
[ ok ] Restarting nfs-kernel-server (via systemctl): nfs-kernel-server.service.

mbookman@frontend001:~$ ps -ef | grep nfs
root     21236     2  0 21:36 ?        00:00:00 [nfsiod]
root     25983     2  0 22:02 ?        00:00:00 [nfsd4]
root     25984     2  0 22:02 ?        00:00:00 [nfsd4_callbacks]
root     25988     2  0 22:02 ?        00:00:00 [nfsd]
root     25989     2  0 22:02 ?        00:00:00 [nfsd]
root     25990     2  0 22:02 ?        00:00:00 [nfsd]
root     25991     2  0 22:02 ?        00:00:00 [nfsd]
root     25992     2  0 22:02 ?        00:00:00 [nfsd]
root     25993     2  0 22:02 ?        00:00:00 [nfsd]
root     25994     2  0 22:02 ?        00:00:00 [nfsd]
root     25995     2  0 22:02 ?        00:00:00 [nfsd]
mbookman 26024 25926  0 22:02 pts/0    00:00:00 grep nfs

What is strange - looking through the Elasticluster Ansible output, I see:

TASK [nfs-server : ensure NFS server is running (Debian/Ubuntu)] ***************
task path: <snip>/elasticluster-1.3.dev0-py2.7.egg-tmp/elasticluster/share/playbooks/roles/nfs-server/tasks/main.yml:43
ok: [frontend001] => (item=nfs-kernel-server) => {"changed": false, "enabled": true, "item":
 "nfs-kernel-server", "name": "nfs-kernel-server", "state": "started", "status": {"ActiveEnterTimestamp": "Thu 2016-12-01 21:36:08 UTC",
 "ActiveEnterTimestampMonotonic": "218388084", "ActiveExitTimestampMonotonic": "0", "ActiveState": "active", "After": "remote-fs.target nfs-common.service rpcbind.target time-sync.target nss-lookup.target systemd-journald.socket basic.target system.slice",
 "AllowIsolate": "no", "Before": "multi-user.target graphical.target shutdown.target", "BlockIOAccounting": "no", "BlockIOWeight": "18446744073709551615",
 "CPUAccounting": "no", "CPUQuotaPerSecUSec": "(null)", "CPUSchedulingPolicy": "0", "CPUSchedulingPriority": "0", "CPUSchedulingResetOnFork": "no", 
"CPUShares": "18446744073709551615", "CanIsolate": "no", "CanReload": "yes",
 "CanStart": "yes", "CanStop": "yes", "CapabilityBoundingSet": "18446744073709551615", "ConditionResult": "yes", "ConditionTimestamp": "Thu 2016-12-01 21:36:08 UTC", "ConditionTimestampMonotonic": "218371213", "Conflicts": "shutdown.target",
 "ControlPID": "0", "DefaultDependencies": "yes",
 "Description": "LSB: Kernel NFS server support", "DevicePolicy": "auto",
 "ExecMainCode": "0", "ExecMainExitTimestampMonotonic": "0", "ExecMainPID": "0", "ExecMainStartTimestampMonotonic": "0", "ExecMainStatus": "0",
 "ExecReload": "{ path=/etc/init.d/nfs-kernel-server ; argv[]=/etc/init.d/nfs-kernel-server reload ; ignore_errors=no ; start_time=[n/a] ; stop_time=[n/a] ; pid=0 ; code=(null) ; status=0/0 }",
 "ExecStart": "{ path=/etc/init.d/nfs-kernel-server ; argv[]=/etc/init.d/nfs-kernel-server start ; ignore_errors=no ;
 <etc>

But I see no output from this operation. It isn't clear to me whether/etc/init.d/nfs-kernel-server start actually ran. Either it did not run, or it exited early because I can see in /etc/init.d/nfs-kernel-server:

                $PREFIX/sbin/exportfs -r

So at the very least, I would have expected to see the very same output we see from the ansible step "TASK [nfs-server : Reload NFS exports file]".

riccardomurri commented 7 years ago

Hi Matt,

(mbookman, Mon, Dec 05, 2016 at 09:13:35AM -0800:)

Riccardo - do you have suggestions on how to pause Ansible, as orchestrated by Elasticluster, prior to a specific step?

Are these Ansible options what you are looking for?

--skip-tags=SKIP_TAGS
                      only run plays and tasks whose tags do not match these
                      values

--start-at-task=START_AT_TASK
                      start the playbook at the task matching this
                      name

--step                one-step-at-a-time: confirm each task before running

You can use them (like any other Ansible command-line option) by appending them to the elasticluster setup command line::

elasticluster setup gridengine -- --step

See more examples at: http://elasticluster.readthedocs.io/en/latest/usage.html#the-setup-command

riccardomurri commented 7 years ago

I see this in the logs, during the Ansible playbook runs or if I try to manually start the nfs-kernel-server on Debian 8.6::

systemd[1]: Started LSB: RPC portmapper replacement.
systemd[1]: Starting RPC Port Mapper.
systemd[1]: Reached target RPC Port Mapper.
systemd[1]: Reloading.
systemd[1]: Reloading.
systemd[1]: Starting LSB: NFS support files common to client and server...
rpc.statd[21110]: Version 1.2.8 starting
sm-notify[21111]: Version 1.2.8 starting
rpc.statd[21110]: Failed to read /var/lib/nfs/state: Success
rpc.statd[21110]: Initializing NSM state
kernel: [  335.591870] RPC: Registered named UNIX socket transport module.
kernel: [  335.591872] RPC: Registered udp transport module.
kernel: [  335.591873] RPC: Registered tcp transport module.
kernel: [  335.591874] RPC: Registered tcp NFSv4.1 backchannel transport module.
kernel: [  335.594197] FS-Cache: Loaded
kernel: [  335.597063] FS-Cache: Netfs 'nfs' registered for caching
kernel: [  335.600159] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
nfs-common[21105]: Starting NFS common utilities: statd idmapd.
systemd[1]: Started LSB: NFS support files common to client and server.
systemd[1]: Reloading.
systemd[1]: Reloading.
systemd[1]: Starting LSB: Kernel NFS server support...
nfs-kernel-server[21329]: Not starting NFS kernel daemon: no exports. ... (warning).
systemd[1]: Started LSB: Kernel NFS server support.

After that, service nfs-kernel-server status reports that the NFS server has exited, and rpcinfo -p localhost shows that no nfsd port has been registered with the portmapper.

So I guess the root problem is that "no exports" part: the NFS kernel server exists because it can load no exports from the /etc/exports file.

Now, loading the export definitions should be the task of exportfs; if I run it, I get this error message::

$ sudo /usr/sbin/exportfs -r -v
exportfs: /etc/exports [1]: Neither 'subtree_check' or 'no_subtree_check' specified for export "worker001:/home".
  Assuming default behaviour ('no_subtree_check').
  NOTE: this default has changed since nfs-utils version 1.0.x

exporting gridengine-worker001.novalocal:/home
exporting gridengine-worker001.novalocal:/home to kernel
exportfs: gridengine-worker001.novalocal:/home: Function not implemented

Running it under strace shows that the "Function not implemented" error comes from a nfsservctl system call which is completely new to me::

21637 nfsservctl(0x1, 0x7ffc0c4f7360, 0) = -1 ENOSYS (Function not implemented)

Given that Debian 8.6 is a stable release and NFS is pretty common, I think we can exclude a bug in the base set of packages (e.g., mismatch between exportfs and the NFS kernel server). My guess is that some kernel module needs to be loaded before NFS exports can work, i.e., the nfsservctl system call is implemented in some .ko file not in the main Debian kernel.

pgrosu commented 7 years ago

Riccardo/Matt: What happens if you add the subtree_check to your /etc/exports file like this - I don't have your specific config info, but it should look similar to this:

/your-location 128.11.11.1/24(rw,sync,no_root_squash,subtree_check)

~p

riccardomurri commented 7 years ago

(Paul Grosu, Mon, Dec 05, 2016 at 02:52:54PM -0800:)

Riccardo/Matt: What happens if you add the subtree_check to your /etc/exports file like this - I don't have your specific config info, but it should look similar to this:


/your-location 128.11.11.1/24(rw,sync,no_root_squash,subtree_check)

Nothing changes here: same error from exportfs, NFS kernel server does not start.

riccardomurri commented 7 years ago

Running it under strace shows that the "Function not implemented" error comes from a nfsservctl system call which is completely new to me::

21637 nfsservctl(0x1, 0x7ffc0c4f7360, 0) = -1 ENOSYS (Function not implemented)

Given that Debian 8.6 is a stable release and NFS is pretty common, I think we can exclude a bug in the base set of packages (e.g., mismatch between exportfs and the NFS kernel server). My guess is that some kernel module needs to be loaded before NFS exports can work, i.e., the nfsservctl system call is implemented in some .ko file not in the main Debian kernel.

I stand corrected: apparently the nfsservctl syscall existed in Linux some time ago, but has been removed. So it seems that exportfs is using the wrong interface to the kernel, for whatever reason.

riccardomurri commented 7 years ago

Well, it looks like it's indeed a kernel / userspace version mismatch.

The Debian image I am using comes with kernel 3.16.0.4 installed::

debian@master001:/usr/src/linux-source-3.16$ uname -r
3.16.0-4-amd64

Now, one of the first things ElastiCluster's "common" playbook does is to upgrade all packages to the latest available version; this includes the kernel and the "nfs-utils".

However, now the installed kernel packages show a higher version number::

$ sudo apt-cache policy linux-image-3.16.0-4-amd64
linux-image-3.16.0-4-amd64:
  Installed: 3.16.36-1+deb8u2
  Candidate: 3.16.36-1+deb8u2
  Version table:
 *** 3.16.36-1+deb8u2 0
        500 http://security.debian.org/ jessie/updates/main amd64 Packages
        100 /var/lib/dpkg/status
     3.16.36-1+deb8u1 0
        500 http://http.debian.net/debian/ jessie/main amd64 Packages
     3.16.7-ckt25-2 0
        500 http://http.debian.net/debian/ jessie-updates/main amd64 Packages

After a reboot, the NFS kernel server magically started to work and I no longer get any error from exportfs.

So, possibly, exportfs from package nfs-common=1:1.2.8-9 (latest version as I'm writing this) is not compatible with kernel 3.16.0-4 but needs a higher kernel version...

riccardomurri commented 7 years ago

Nope. Same error without updating packages. A reboot fixes the problem, anyway. (And change in the kernel version is very very minor.)

riccardomurri commented 7 years ago

I managed to get the NFS server up&running without rebooting, by doing two things:

  1. Mount the /proc/fs/nfsd filesystem::

    sudo mount -t nfsd nfsd /proc/fs/nfsd
  2. Use the start script in /etc/init.d instead of the systemd service file::

    sudo env _SYSTEMCTL_SKIP_REDIRECT=true sh -x /etc/init.d/nfs-kernel-server start

I am not sure which one of the two did actually solve it.

riccardomurri commented 7 years ago

I have added the following two stanzas to roles/nfs-server/main.yml, in between "install NFS server software" and "Export directories"::

    - name: Mount /proc/fs/nfsd
      mount:
        name='/proc/fs/nfsd'
        state=mounted
        fstype='nfsd'
        src='nfsd'
      when: is_debian_8

    - name: Compatibility symlink
      file:
        path='/usr/share/gridengine/default'
        src='/var/lib/gridengine/default'
        state=link
      when: is_debian_8

With this, Ansible reports the NFS server as running and the exportfs -r command does not print any error string either::

    TASK [nfs-server : ensure NFS server is running (Debian/Ubuntu)] ***************
    ok: [master001] => (item=nfs-kernel-server) => {"changed": false, "enabled": true, "item": "nfs-kernel-server", "name": "nfs-kernel-server", "state": "started"}

    ...

    TASK [nfs-server : Reload NFS exports file] ************************************
    changed: [master001] => {"changed": true, "cmd": "exportfs -r", "delta": "0:00:00.005729", "end": "2016-12-06 00:37:53.261844", "rc": 0, "start": "2016-12-06 00:37:53.256115", "stderr": "exportfs: /etc/exports [1]: Neither 'subtree_check' or 'no_subtree_check' specified for export \"worker001:/home\".\n  Assuming default behaviour ('no_subtree_check').\n  NOTE: this default has changed since nfs-utils version 1.0.x\n\nexportfs: /etc/exports [2]: Neither 'subtree_check' or 'no_subtree_check' specified for export \"worker001:/usr/share/gridengine/default/common\".\n  Assuming default behaviour ('no_subtree_check').\n  NOTE: this default has changed since nfs-utils version 1.0.x\n\nexportfs: /etc/exports [2]: Neither 'subtree_check' or 'no_subtree_check' specified for export \"master001:/usr/share/gridengine/default/common\".\n  Assuming default behaviour ('no_subtree_check').\n  NOTE: this default has changed since nfs-utils version 1.0.x", "stdout": "", "stdout_lines": [], "warnings": []}

However, the NFS server on the master node is not running and the playbooks still stops when mounting /home on the worker nodes.

Manually doing a systemctl restart nfs-kernel-server fixes it.

riccardomurri commented 7 years ago

I think I have a working playbook in branch issues/#357 on my fork https://github.com/riccardomurri/elasticluster

Can you please try it and tell me whether it also works for you?

The root cause of the issue seems to be a malfunctioning in Debian's nfs-kernel-server systemd service: it is marked with "Active" state although the server is not started, and everything just snowballs from there. Forcing the server to restart apparently fixes the issue. Some more investigation will be needed to understand whether there is a better/minimal fix.

riccardomurri commented 7 years ago

After a couple more tries, it seems that the critical part is restarting the NFS server, i.e., use state=restarted in task "ensure NFS server is running (Debian/Ubuntu)".

I'll try to commit a fix to "master" later today.

riccardomurri commented 7 years ago

I have reported (what I think is) the root cause of the issue to Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=847204

pgrosu commented 7 years ago

Riccardo, thank you for fixing this and for the detailed troubleshooting!

~p

riccardomurri commented 7 years ago

Hi Matt and Paul,

I have just pushed my fixes to this issue to "master":

It works on my Debian 8.6 image on the local cloud, please test and let me know.

mbookman commented 7 years ago

Thanks Riccardo!

I pulled the latest and started a new cluster. The cluster creation completed successfully (Your cluster is ready!)

I have verified that the home and the gridengine common directories from the frontend node are mounted on the compute node:

mbookman@compute001:~$ df -k | grep frontend
frontend001:/home                                 10188288 1454592   8193024  16% /home
frontend001:/usr/share/gridengine/default/common  10188288 1454592   8193024  16% /usr/share/gridengine/default/common

This has revealed a different problem. I cannot SSH from one host to another:

mbookman@frontend001:~$ ssh compute001
no matching hostkey found for key ED25519 <snip>
ssh_keysign: no reply
key_sign failed
Permission denied (publickey,hostbased).

and similarly:

mbookman@compute001:~$ ssh frontend001
no matching hostkey found for key ED25519 <snip>
ssh_keysign: no reply
key_sign failed
Permission denied (publickey,hostbased).

I'll see what I can find in the output and file a new issue.

riccardomurri commented 7 years ago

Thanks for checking! I'll close this issue then -- please open a new one for the SSH problem.

riccardomurri commented 7 years ago

Issue seems to be back, as of 2017-02-15::

TASK [nfs-server : Reload NFS exports file] ************************************
changed: [master001] => {"changed": true, "cmd": "exportfs -r", ... "stderr": "exportfs: worker001:/var/lib/gridengine/default/common: Function not implemented\nexportfs: master001:/var/lib/gridengine/default/common: Function not implemented\nexportfs: worker001:/home: Function not implemented", "stdout": "", "stdout_lines": [], "warnings": []}
riccardomurri commented 7 years ago

I cannot reproduce the issue with an up-to-date Debian 8 "jessie" VM image.

With a not up-to-date one, what happens seems to be the following:

A reboot of the server VM, followed by another run of elasticluster setup solves the problem -- as does using an up-to-date Debian 8 image.

riccardomurri commented 7 years ago

I'm marking this as "fix available" -- there's really not much to fix, but we need to add an entry in the release notes or troubleshooting section.