ganga-devs / ganga

Ganga is an easy-to-use frontend for job definition and management
GNU General Public License v3.0
100 stars 159 forks source link

Ganga hanging on RAL T1 cluster - any clues? #463

Closed twhyntie closed 8 years ago

twhyntie commented 8 years ago

Hello,

I'm trying to run Ganga on the RAL Tier-1 cluster. I've tried both running from CVMFS and installing via the install script ganga-install and both hang. Using the --debug flag it gets this far:

$ . /cvmfs/ganga.cern.ch/runGanga.sh
[...]
2016-05-11 10:15:25,010 MainThread __init__            ::startUpQueues       :18 DEBUG   : Starting Queues
2016-05-11 10:15:25,160 MainThread GangaPlugin         ::add                 :98 DEBUG   : adding plugin Batch (category "backends") 
2016-05-11 10:15:25,210 MainThread Schema              ::_check_type         :544 DEBUG   : valType: <type 'dict'> defValueType: <type 'dict'> name: JobTree.folders
2016-05-11 10:15:25,222 MainThread GangaPlugin         ::add                 :98 DEBUG   : adding plugin JobTree (category "jobtrees") 
2016-05-11 10:15:25,256 MainThread Repository_runtime  ::checkDiskQuota      :92 DEBUG   : Error checking disk partition: invalid literal for int() with base 10: 'Filesystem                                1024-blocks     Used  Available Capacity Mounted on\nnfs1.gridpp.rl.ac.uk:/home/tier1/twhyntie  1441830080 64705208 1362476728       5'
2016-05-11 10:15:25,276 MainThread Repository_runtime  ::checkDiskQuota      :92 DEBUG   : Error checking disk partition: invalid literal for int() with base 10: 'Filesystem                                1024-blocks     Used  Available Capacity Mounted on\nnfs1.gridpp.rl.ac.uk:/home/tier1/twhyntie  1441830080 64705208 1362476728       5'
2016-05-11 10:15:25,277 MainThread Repository_runtime  ::bootstrap           :132 DEBUG   : Registry: prep
2016-05-11 10:15:25,289 MainThread Repository_runtime  ::bootstrap           :133 DEBUG   : Loc: /home/tier1/twhyntie/gangadir/repository/twhyntie/LocalXML
2016-05-11 10:15:25,302 MainThread Registry            ::startup             :930 DEBUG   : metadata startup
2016-05-11 10:15:25,314 MainThread Registry            ::startup             :935 DEBUG   : repo startup

I've tried CVMFS running on two other clusters (QMUL and RALPP) with success, so I'm guessing there's something I need to to on the cluster side - but is there anything obvious that jumps out at you from these messages?

Many thanks in advance for any help/advice, @twhyntie

egede commented 8 years ago

@drmarkwslater Can you help Tom?

@twhyntie One obvious question coming to mind is if the file system you are using is "unusual" in some sense.

twhyntie commented 8 years ago

@egede Thanks - this is a good question :-) What counts as "unusual" for Ganga?

@alahiff knows about the cluster (I don't, I've just been given access as a test "GridPP new user").

alahiff commented 8 years ago

The home directory is mounted from an NFS server. @twhyntie what happens if you use /scratch/twhyntie instead? (this is on the local disk)

milliams commented 8 years ago

At least that Repository_runtime ::checkDiskQuota message is suspicious. I don't see that it should cause a hang though. It's broken on my local system in the same way but I don't see a hang.

@twhyntie Could you try starting up Ganga, letting it hang for 10-15 seconds and then look in /home/tier1/twhyntie/gangdir and see if there's a file in there called thread_trace.html. That will help to show us what's causing the hang.

twhyntie commented 8 years ago

@milliams Thanks - thread_trace.html is there as described. Here are the last two <h2> entries:

<h2>MainThread</h2>
<pre>  File "/home/tier1/twhyntie/Ganga/install/6.1.19/bin/ganga", line 65, in <module>
    Ganga.Runtime._prog.bootstrap(Ganga.Runtime._prog.interactive)
  File "/home/tier1/twhyntie/Ganga/install/6.1.19/python/Ganga/Runtime/bootstrap.py", line 940, in bootstrap
    startUpRegistries()
  File "/home/tier1/twhyntie/Ganga/install/6.1.19/python/Ganga/Runtime/Repository_runtime.py", line 254, in startUpRegistries
    for n, k, d in bootstrap():
  File "/home/tier1/twhyntie/Ganga/install/6.1.19/python/Ganga/Runtime/Repository_runtime.py", line 134, in bootstrap
    registry.startup()
  File "/home/tier1/twhyntie/Ganga/install/6.1.19/python/Ganga/GPIDev/Lib/Registry/PrepRegistry.py", line 28, in startup
    super(PrepRegistry, self).startup()
  File "/home/tier1/twhyntie/Ganga/install/6.1.19/python/Ganga/Core/GangaRepository/Registry.py", line 168, in decorated
    return f(self, *args, **kwargs)
  File "/home/tier1/twhyntie/Ganga/install/6.1.19/python/Ganga/Core/GangaRepository/Registry.py", line 931, in startup
    self.metadata.startup()
  File "/home/tier1/twhyntie/Ganga/install/6.1.19/python/Ganga/Core/GangaRepository/Registry.py", line 168, in decorated
    return f(self, *args, **kwargs)
  File "/home/tier1/twhyntie/Ganga/install/6.1.19/python/Ganga/Core/GangaRepository/Registry.py", line 937, in startup
    self.repository.startup()
  File "/home/tier1/twhyntie/Ganga/install/6.1.19/python/Ganga/Core/GangaRepository/GangaRepositoryXML.py", line 215, in startup
    self.sessionlock.startup()
  File "/home/tier1/twhyntie/Ganga/install/6.1.19/python/Ganga/Core/GangaRepository/SessionLock.py", line 304, in decorated
    return f(self, *args, **kwargs)
  File "/home/tier1/twhyntie/Ganga/install/6.1.19/python/Ganga/Core/GangaRepository/SessionLock.py", line 391, in startup
    self.global_lock_acquire()
  File "/home/tier1/twhyntie/Ganga/install/6.1.19/python/Ganga/Core/GangaRepository/SessionLock.py", line 563, in global_lock_acquire
    self.delay_lock_mod(self.lockfd, fcntl.LOCK_EX)
  File "/home/tier1/twhyntie/Ganga/install/6.1.19/python/Ganga/Core/GangaRepository/SessionLock.py", line 519, in delay_lock_mod
    fcntl.lockf(lockfd, lock_mod)
</pre>
<h2>GANGA_Update_Thread_Ganga_Worker_0</h2>
<pre>  File "/usr/lib64/python2.6/threading.py", line 504, in __bootstrap
    self.__bootstrap_inner()
  File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
    self.run()
  File "/usr/lib64/python2.6/threading.py", line 484, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/tier1/twhyntie/Ganga/install/6.1.19/python/Ganga/Core/GangaThread/WorkerThreads/WorkerThreadPool.py", line 78, in __worker_thread
    item = self.__queue.get(True, 0.05)
  File "/usr/lib64/python2.6/Queue.py", line 177, in get
    self.not_empty.wait(remaining)
  File "/usr/lib64/python2.6/threading.py", line 258, in wait
    _sleep(delay)
</pre>
<small>2016-05-11T09:34:49.911054</small>
twhyntie commented 8 years ago

@alahiff As in set $HOME to /scratch/twhyntie/ and run the Ganga install script in there?

alahiff commented 8 years ago

@twhyntie In your first message I saw /home/tier1/twhyntie/gangadir/repository/twhyntie/LocalXML so I'd assumed you'd done something which would cause this path to be used. Does Ganga always use the home directory by default?

The only 'strange' thing about /home/tier1/twhyntie is NFS, so that's why I thought it might be worthwhile trying /scratch/twhyntie

twhyntie commented 8 years ago

@alahiff Yup, that worked for both the local install and CVMFS :-)

$ export HOME=/scratch/twhyntie
. /cvmfs/ganga.cern.ch/runGanga.sh
[...]
[10:45:37]
Ganga In [1]:

Ganga seems to use $HOME as the default install directory (right?). So running from a NFS-mounted $HOME directory apparently causes a problem for Ganga - is this likely to be a wider issue for other Ganga users/clusters?

Anyway, happy to close this now - and thanks everyone for the help! Cheers, @twhyntie

milliams commented 8 years ago

Ganga does indeed use the home directory by default (though it can be overridden in the config). It should work fine on NFS as that is how I often run it but it's worth testing other options. Ganga seemed to be struggling with file locks which were causing the problems. Is this something you've seen before @alahiff?

drmarkwslater commented 8 years ago

Ganga should be absolutely fine on an NFS system (we haven't had a report like this before anyway!). It does seem to show that it's hanging when attempting to lock on a file from python 'fcntl' library.

As a test, can you try from within plain python:

import os, fcntl
open("/home/tier1/twhyntie/gangadir/mylock", "w").close()
fd = os.open("/home/tier1/twhyntie/gangadir/mylock", os.O_RDWR)
fcntl.lockf(fd, fcntl.LOCK_EX)
fcntl.lockf(fd, fcntl.LOCK_UN)

This is basically all Ganga is doing at this point and so maybe there's an issue with file locking on this NFS share.

rob-c commented 8 years ago

@milliams Am I correct in thinking there was a PR addressing multithreading issues in SessionLock in the past I think this is thread-safe? Could there be some stale filelock not being cleaned up on nfs? (apologies I don't know how nfs handles this off the top of my head)

twhyntie commented 8 years ago

@drmarkwslater Thanks - have tested (with $HOME=/home/tier1/twhyntie) and get the following error:

Traceback (most recent call last):
  File "testformark.py", line 8, in <module>
    fcntl.lockf(fd, fcntl.LOCK_EX)
IOError: [Errno 37] No locks available

So, yes, looks like a non-Ganga issue (and not a show-stopper for running Ganga on the Ral T1 cluster with @alahiff's workaround).

egede commented 8 years ago

@twhyntie I would recommend to change the location of the gangadir in the .gangarc configuration instead of changing $HOME (which could cause other problems). Just uncomment the gangadir entry in the file and set it to whatever other value you like.

drmarkwslater commented 8 years ago

Right - that's the problem then 😄 Now having said that, allowing Ganga to run on a filesystem without file locks seems like a sensible idea to me. I've created an issue for it: #465

twhyntie commented 8 years ago

@egede Sure, but is there an option to set gangadir when running /cvmfs/ganga.cern.ch/runGanga.sh or ganga-install? Apologies if I've missed them...

drmarkwslater commented 8 years ago

@twhyntie the runGanga.sh script will pick up everything in your .gangarc so that's definitely the best option as @egede suggests (it actually does very little other than run Ganga 😄 ). You can also specify it at the command line as well:

/cvmfs/ganga.cern.ch/runGanga.sh -o[Configuration]gangadir=/scratch/twhyntie/gangadir
twhyntie commented 8 years ago

@drmarkwslater OK, thanks - as you don't get .gangarc until you do the install,

. /cvmfs/ganga.cern.ch/runGanga.sh -o[Configuration]gangadir=/scratch/twhyntie/gangadir

is what is needed, right?

drmarkwslater commented 8 years ago

This is a good point I didn't think about as it would have to start Ganga to even generate the .gangarc. I'll add this to the tutorial on readthedocs! but yes, what you have there should work without issue.