ganga-devs / ganga

Ganga is an easy-to-use frontend for job definition and management
GNU General Public License v3.0
100 stars 159 forks source link

Problems with running slc6 based GaudiExec #2320

Closed mesmith75 closed 7 months ago

mesmith75 commented 7 months ago

For some very old applications (things requiring slc6) you need to use apptainer as they are not functional with el9.

Should be a very short addition to the run line command.

heistera commented 7 months ago

Looks like the standard ganga virtualisation, e.g. using something like:

j.virtualization = Apptainer("/cvmfs/cernvm-prod.cern.ch/cvm4")

or

j.virtualization = Apptainer("docker://gitlab-registry.cern.ch/lhcb-core/lbdocker/slc6-build:latest")

does not work for GaudiExec?

egede commented 7 months ago

Looks like the standard ganga virtualisation, e.g. using something like:

j.virtualization = Apptainer("/cvmfs/cernvm-prod.cern.ch/cvm4")

or

j.virtualization = Apptainer("docker://gitlab-registry.cern.ch/lhcb-core/lbdocker/slc6-build:latest")

does not work for GaudiExec?

While it might be possible to get that to work, it will be better to just implement it in a transparent way for the GaudiExec application.

heistera commented 7 months ago

Looks like the standard ganga virtualisation, e.g. using something like:

j.virtualization = Apptainer("/cvmfs/cernvm-prod.cern.ch/cvm4")

or

j.virtualization = Apptainer("docker://gitlab-registry.cern.ch/lhcb-core/lbdocker/slc6-build:latest")

does not work for GaudiExec?

While it might be possible to get that to work, it will be better to just implement it in a transparent way for the GaudiExec application.

As a user a timely solution would be great. Let me know if and how I could help.

egede commented 7 months ago

I was investigating this a bit further. So there are two issues at play here. I consider a job of the type

j = Job(application = prepareGaudiExec('DaVinci','v39r1p6', myPath='.', platform='x86_64-slc6-gcc49-opt', options=['empty.py'])

where empty.py is a python file with just the line pass in it.

Configuration

If you are on an el9 machine, then j.prepare() will fail for this job as the cmake command fails. If starting Ganga inside a centos7 apptainer, then the j.prepare() step works. Clearly there is an issue that should be fixed there.

Running

Having prepared the job inside a centos7 apptainer, then job can then be submitted (from a standard session running on el9). The job then runs on the Dirac backend just fine. This is compatible with observations from others that jobs start but then crash later. In the JDL for the job when looking at the Dirac monitoring, I see Platform = "x86_64-slc6" which is correct and in the stderr of the job, I also see WARNING:lb-run:Decided best container to use is apptainer which indicates that the job already run inside an apptainer. I indeed confirm this by running the job with the Local backend. So for runtime errors, it looks like a problem with how lb-run works and not a Ganga problem.

egede commented 7 months ago

The command

exec lb-run   --siteroot=${MYSITEROOT:-/cvmfs/lhcb.cern.ch/lib} -c x86_64-slc6-gcc49-opt --path-to-project ${base_dir}/DaVinciDev_v39r1p6 bash

indeed sees you ending up in an slc6 environment

DaVinciDev v39r1p6] DaVinciDev_v39r1p6 $ cat /etc/redhat-release 
Scientific Linux release 6.9 (Carbon)

however, you can't run the configuration step inside that apptainer

[DaVinciDev v39r1p6] DaVinciDev_v39r1p6 $ cmake --build /home/egede/DaVinciDev_v39r1p6/build.x86_64-slc6-gcc49-opt --target ganga-input-sandbox
cmake: symbol lookup error: /cvmfs/lhcb.cern.ch/lib/var/lib/LbEnv/3114/stable/linux-64/bin/../lib/libuv.so.1: undefined symbol: sendmmsg

So we can't run the cmake command inside an slc6 environment (for a DaCinci version that requires slc6), but it works on centos7. Not very helpful.

mesmith75 commented 7 months ago

Yes we can run the command inside the apptainer. I'll open an MR later today.

mesmith75 commented 7 months ago

As fair as I can tell it is just the make that needs adjusting. The jobs seem to run automatically with apptainer on the WN.

egede commented 7 months ago

In that case the runtime errors reported are completely unrelated. Let's see.

mesmith75 commented 7 months ago

Getting the build fixed at least is useful though. I ran an example job fine - the log showed it ran inside a container.