ganga-devs / ganga

Ganga is an easy-to-use frontend for job definition and management
GNU General Public License v3.0
97 stars 159 forks source link

GaudiExec applications not running in container #2328

Closed egede closed 4 months ago

egede commented 4 months ago

If a GaudiExec application is created with a release that requires an old platform like slc6, the job is in fact not running inside the container. The problem affects both the Dirac and the Local backend. After debugging, it turns out that the run script written by the GaudiExec runtime handler is missing the argument --container apptainer. With that in place the job runs.

egede commented 4 months ago

@heistera A fix will be implemented for this soon.

laf070810 commented 4 months ago

I am also bitten by this bug.

ERROR:lb-run:current host does not support platform x86_64_v2-centos7-gcc11-opt (dirac_platform: broadwell-el9, required: x86_64_v2-centos7, os_id: almalinux9)

Glad to see the work on-going.

egede commented 4 months ago

I am also bitten by this bug.

ERROR:lb-run:current host does not support platform x86_64_v2-centos7-gcc11-opt (dirac_platform: broadwell-el9, required: x86_64_v2-centos7, os_id: almalinux9)

Glad to see the work on-going.

Not obvious to me that this is the same problem. In the other cases we have seen a runtime error, whereas in the case you report here, the Gaudi job doesn't even start. The cure may be the same though.

heistera commented 4 months ago

I agree with Ulrik. To me it looked like the jobs which crashed for me had an environment. Not sure if it was the correct one, though ...

laf070810 commented 4 months ago

Indeed, my jobs didn't even start...but the cure may be the same. If I run lb-run manually, it can automatically choose apptainer and run normally. But running in ganga will yield the above error.

mesmith75 commented 4 months ago

I find this strange - my test jobs said they were running in apptainer. I guess there is nothing wrong with being explicit about though.

egede commented 4 months ago

My investigation is so far only for the Local backend (where the apptainer message is not there). So we might not be all the way there. However, it turns out the run file with the lb-run command inside is written by the make step.

build.x86_64-slc6-gcc49-opt/ganga/run:exec lb-run   --siteroot=${MYSITEROOT:-/cvmfs/lhcb.cern.ch/lib} -c x86_64-slc6-gcc49-opt --path-to-project ${base_dir}/DaVinciDev_v39r1p6 "$@"

The file should instead have

exec lb-run   --siteroot=${MYSITEROOT:-/cvmfs/lhcb.cern.ch/lib} -c x86_64-slc6-gcc49-opt --container apptainer --path-to-project ${base_dir}/DaVinciDev_v39r1p6 "$@"

for the Local backend to work.

egede commented 4 months ago

And there is in fact a further problem. When a job is submitted with the Local backend, it inherits the environment of the Ganga session. This (among other things) means that a different (and older) version of lb-run is used which doesn't understand the --container option.

mesmith75 commented 4 months ago

I have successfully reproduced the original issue on the grid. I guess it is not doing what I thought after all