clusterinthecloud / support

If you need help with Cluster in the Cloud, this is the right place
2 stars 0 forks source link

spack build on NFS filesystem fails with "Error: Timed out waiting for lock." #6

Closed christopheredsall closed 4 years ago

christopheredsall commented 5 years ago

On a freshly built cluster (commit: ACRC/oci-cluster-terraform@55c148ff3cf9c6c44b40f0a41e3bfb1f59feb769)

Trying to build a package with spack in a home directory mounted from the NFS server fails with

[ce16990@mgmt ~]$ spack install hdf5
==> Error: Timed out waiting for lock.

Whereas doing the same in /tmp succeeds

[ce16990@mgmt tmp]$ spack install hdf5
==> Installing libsigsegv
==> Searching for binary cache of libsigsegv
[ ... ]

Using the NFStest suite command nfstest_lock shows some test failures relating to overlapping ranges.

[opc@mgmt ~]$ sudo /usr/bin/nfstest_lock --server fileserver --export=/shared --nfsversion=3
[ ... ]
*** Locking same range from a second process
    TEST: Running test 'optest01'
[ ... ]
    FAIL: Timeout waiting for blocked lock to be granted (0 passed, 12 failed)
[ ... ]
*** Locking overlapping range from a second process where start2 < start1
    TEST: Running test 'optest02'
[ ... ]
    FAIL: Timeout waiting for blocked lock to be granted (0 passed, 12 failed)
[ ... ]
*** Locking overlapping range from a second process where start2 < start1 and start2 == 0
[ ... ]
    FAIL: Timeout waiting for blocked lock to be granted (0 passed, 12 failed)
[ ... ]
christopheredsall commented 5 years ago

The Oracle Overview of File Storage documentation says:

The File Storage service supports the Network File System version 3.0 (NFSv3) protocol. The service supports the Network Lock Manager (NLM) protocol for file locking functionality.

christopheredsall commented 5 years ago

Workaround

cat > ~/.spack/config.yaml << EOT
---
config:
  locks: false
EOT

N.B. the caveat in the documentation about not running several spack commands at the same time:

 # When set to true, concurrent instances of Spack will use locks to
 # avoid modifying the install tree, database file, etc. If false, Spack
 # will disable all locking, but you must NOT run concurrent instances
 # of Spack.  For filesystems that don't support locking, you should set
 # this to false and run one Spack at a time, but otherwise we recommend
 # enabling locks.
 locks: true
christopheredsall commented 4 years ago

Fixed

It seems that Oracle have made changes to their NFS service.

The tests now produce different (better) results for locking

[ ... ]
*** Locking same range from a second process
    TEST: Running test 'optest01'
    PASS: Locking byte range (72 passed, 0 failed)
    PASS: Locking with overlapping range on second process (48 passed, 0 failed)
    PASS: Unlocking full file should be granted (36 passed, 0 failed)
    PASS: Unlocking full file on second process should be granted (32 passed, 0 failed)
    PASS: Locking byte range on second process (12 passed, 0 failed)
    PASS: Unlocking full file after delay should be granted (12 passed, 0 failed)
    FAIL: Blocked lock is granted after conflicting lock is released (9 passed, 3 failed)
    FAIL: Timeout waiting for blocked lock to be granted (0 passed, 3 failed)
    TIME: 1m54.248020s
[ ... ]

And spack now works without any workarounds

[chris@mgmt ~]$ pwd
/mnt/shared/home/chris
[chris@mgmt ~]$ scl enable devtoolset-8 bash
[chris@mgmt ~]$ git clone https://github.com/spack/spack.git
Cloning into 'spack'...
[ ... ]
[chris@mgmt ~]$ . spack/share/spack/setup-env.sh
[chris@mgmt ~]$ spack install hdf5
==> Installing libsigsegv
==> Searching for binary cache of libsigsegv
[ ... ]
==> Successfully installed hdf5
  Fetch: 5.39s.  Build: 7m 49.51s.  Total: 7m 54.90s.
[+] /mnt/shared/home/chris/spack/opt/spack/linux-ol7-skylake_avx512/gcc-8.3.1/hdf5-1.10.5-jbiixqh4sdijblewehlafmfspt3ndjlm