cms-tau-pog / TauFW

Analysis framework for tau analysis at CMS using NanoAOD
9 stars 41 forks source link

HTCondor configuration & singularity for SLC7/CentOS7 compatibility #66

Open IzaakWN opened 7 months ago

IzaakWN commented 7 months ago

Issue: Environment not set

Since March, HTCondor jobs on lxplus do not have the CMSSW environment set correctly, nor JOBID or TASKID as defined in submit_HTCondor.sub. This causes the following error and subsequent job failure:

Traceback (most recent call last):
  File "/afs/cern.ch/user/i/ineuteli/analysis/CMSSW_12_4_8_g-2/src/TauFW/PicoProducer/python/processors/picojob.py", line 8, in <module>
    from PhysicsTools.NanoAODTools.postprocessing.framework.postprocessor import PostProcessor
  File "/usr/lib64/python3.6/site-packages/ROOT/_facade.py", line 150, in _importhook
    return _orig_ihook(name, *args, **kwds)
ModuleNotFoundError: No module named 'PhysicsTools'

Our hacky workaround was to hardcode our individual CMSSW_BASE path in the executable submit_HTCondor.sh script and do cmsenv...

The cause appears to be that newer HTCondor versions have a "new syntax" (documented here), and we have to simply change

getenv                = true
environment           = JOBID=$(ClusterId);TASKID=$(ProcId)

to

getenv                = true
environment           = "JOBID=$(ClusterId) TASKID=$(ProcId)"

I'll make a PR with a patch asap.

Issue: SLC7/CC7/CentOS7 compatibility on lxplus

CERN's lxplus is phasing out CentOS7 by end of June 2024 (see this announcement and this page).

If we want to keep using CMSSW 11 or 12 on a SLC7 architecture, we have to use a singularity on lxplus user nodes and in HTCondor jobs, see this page:

CMSSW_BASE="/afs/cern.ch/user/i/ineuteli/analysis/CMSSW_12_4_8/src/TauFW/"
cmssw-el7 --env "CMSSW_BASE=$CMSSW_BASE" # setup singularity & pass environment variable
cd $CMSSW_BASE/src
cmsenv

I'll add this in a future PR as well, and update the instructions in the documentation...

IzaakWN commented 7 months ago

The environment issue in lxplus HTCondor should be solved with PR https://github.com/cms-tau-pog/TauFW/pull/67. It should also be possible now to ask jobs to be run in a singularity container.

One issue, however, is that if you work in a singularity, you lose the ability to submit jobs (condor_submit cannot be found anymore). We need to find a workaround for this... :( It would mean people have to exit the singularity to submit jobs.

However, if you have a CMSSW 11 or 12 setup with SCL7/CC7 inside a cmssw-cc7 singularity on a lxplus EL9 node, C++ libraries like ROOT will stop working, and so we run into a new compatibility issue... Currently,

IzaakWN commented 6 months ago

It now seems possible to submit HTCondor jobs inside (SCL7/CC7) singularities, with the following instructions: https://gitlab.cern.ch/cms-cat/cmssw-lxplus/-/tree/master

#!/bin/bash
export APPTAINER_BINDPATH=/afs,/cvmfs,/cvmfs/grid.cern.ch/etc/grid-security:/etc/grid-security,/cvmfs/grid.cern.ch/etc/grid-security/vomses:/etc/vomses,/eos,/etc/pki/ca-trust,/etc/tnsnames.ora,/run/user,/tmp,/var/run/user,/etc/sysconfig,/etc:/orig/etc
schedd=`myschedd show -j | jq .currentschedd | tr -d '"'`

apptainer -s exec /cvmfs/unpacked.cern.ch/gitlab-registry.cern.ch/cms-cat/cmssw-lxplus/cmssw-el7-lxplus:latest/ sh -c "source /app/setupCondor.sh && export _condor_SCHEDD_HOST=$schedd && export _condor_SCHEDD_NAME=$schedd && export _condor_CREDD_HOST=$schedd && /bin/bash  "

and

export _condor_SCHEDD_HOST=bigbirdXY.cern.ch
export _condor_SCHEDD_NAME=bigbirdXY.cern.ch
export _condor_CREDD_HOST=bigbirdXY.cern.ch

In HTCondor config files:

MY.SingularityImage = "/cvmfs/unpacked.cern.ch/gitlab-registry.cern.ch/cms-cat/cmssw-lxplus/cmssw-el7-lxplus:latest/"

Also see general instructions for using containers with lxplus's HTCondor system: https://batchdocs.web.cern.ch/containers/index.html