cms-sw / genproductions

Generator fragments for MC production
https://twiki.cern.ch/twiki/bin/view/CMS/GitRepositoryForGenProduction
79 stars 786 forks source link

Deprecation of `lxplus7` #3730

Open DickyChant opened 4 months ago

DickyChant commented 4 months ago
qiansitian@sqmbp16 ~> ssh lxplus7
ssh: Could not resolve hostname lxplus7.cern.ch: nodename nor servname provided, or not known

Today I realized that lxplus7 it no longer there...

For MadGraph gridpack generation, the issue is that we need to setup a CMSSW as working environment on the fly, which means 1) if we use the run in one go option, we better need to run everything inside an environment matched to the target scram_arch, saying "el7" for ul. 2) we could split them and try some trick to submit condor jobs from a environment different from target scram_arch, saying we run "CODEGEN" first in a container, then exit and submit... but this option seems to be not working due to the current implementation (https://github.com/cms-sw/genproductions/blob/3c15d3baaa44018390c75b685cf07c9b2988774e/bin/MadGraph5_aMCatNLO/gridpack_generation.sh#L429, at least i cannot make it work)

I have 3 solutions in mind right now:

Those are the options that I feel are feasible (some are already available, some need a little bit of work), but I'd like to go with recommendation from GEN since some of them are not really fitting the roadmap.

DickyChant commented 4 months ago

I am also not so sure if we could have a dirty workaround by setting scram_arch to be different from the OS that we are working with, I wont say it is not worth trying though, and which also needs some tackling of the condor wrap up but that's relatively easy thing.

lviliani commented 4 months ago

Hi @DickyChant, thanks a lot for these checks!

I think another solution is to use https://gitlab.cern.ch/cms-cat/cmssw-lxplus/ , which emulates lxplus7 with condor support.

lviliani commented 4 months ago

There is also a slightly modified version of the above (https://gitlab.cern.ch/lviliani/mg_cmssw_docker/) which I started working on, including also genproductions with the idea to have a container including everything we need to run gridpacks, but it's just a preliminary test for now.

DickyChant commented 4 months ago

Thanks for the heads up!

Do we have the container from cat being unpacked to cvmfs? If so that’s a nice addition! I suffered a lot for getting mine being setup at lxplus…

One thing actually worrisome is the support of condor python API… if it requires a strict IP address check it is inevitable when using container (I never checked this part for singularity, but for docker it is a well known mess up… I never thought I’d experience a similar thing in my life because I decided to avoid docker as much as possible…)

Having container setup is actually nice, I started with dask-lxplus container because I thought it would have better python api support to be usable out of box, then I had to add a full set of dependencies copied from the CMSSW containers… Have genproduction being part of is actually not a bad idea, thou I am afraid that we need at least two things:

1: for NLO we often have libraries being compiled and installed on the fly… which does seem ridiculous because we basically loose many good part of using container… therefore I believe Dominic’s new PR should come before this! 2: And… we are downloading MG on the fly also…

@sihyunjeon I thought you’ve told me about make release of genproductions. Actually maybe instead of making legacy form of release, i.e. code tarball, of genproductions, we could instead release containers through ghcr. Getting a release would need at least download and untar/unzip/… etc but publish a container seems easier to access and to maintain as well (I could imagine that a natural thing is to have a CI job build the container and a follow up CI job test whether it is usable)

lviliani commented 4 months ago

Yes the container is unpacked on cvmfs: /cvmfs/unpacked.cern.ch/gitlab-registry.cern.ch/cms-cat/cmssw-lxplus/cmssw-el7-lxplus

I agree with you in case we decide to include genproductions.

The nice thing of such a container is that it can be easily used also within a CI job to produce gridpacks using gitlab runners. I tested it for a light local gridpack production and it worked. In principle could be extended to use also condor within the CI job but that requires more work I guess.

DickyChant commented 4 months ago

Yes the container is unpacked on cvmfs: /cvmfs/unpacked.cern.ch/gitlab-registry.cern.ch/cms-cat/cmssw-lxplus/cmssw-el7-lxplus

I agree with you in case we decide to include genproductions.

The nice thing of such a container is that it can be easily used also within a CI job to produce gridpacks using gitlab runners. I tested it for a light local gridpack production and it worked. In principle could be extended to use also condor within the CI job but that requires more work I guess.

Actually, I tested my container with a desktop that I had at cern, and since it is within CERN network, it has access to cern htcondor schedds and did condor_q successfully, it should also in principle be able to submit condor jobs. Because basically the dask-lxplus did the same thing as instructed from cern ABP twiki, especially the part on how to get a local htcondor setup accessible to cern htcondor pool. I think CAT’s image is doing the same after a quick glance.

Now, if we think of the powheg CI jobs in this very repo that depends on a VM from cern open stack (i guess @mseidel42 knows more details), it would have afs as long as we activate it via locmaps, as well as the cern internal web env, so I do not see a technical issue to have a CI job being able to do htcondor if we make interactive solution work at lxplus. And such thing should not be technically impossible if we could get an account that has condor authorites like the pdmvserv account. Note that you can basically achieve the same thing with reana.

DickyChant commented 4 months ago

And for sure no issue about doing it with CI, in fact I would imagine our common background team would benefit more since what has been done there is also basically a CI, push new cards first and then a machine picks it up and execute it with the form of submitting condor jobs.

lviliani commented 4 months ago

Right, technically it is possible indeed. We just have to figure out some details in case we want to do that. Yes, reana could also be an option and I think can also be integrated with gitlab CIs, but I'm not familiar with it.

DickyChant commented 4 months ago

Right, technically it is possible indeed. We just have to figure out some details in case we want to do that. Yes, reana could also be an option and I think can also be integrated with gitlab CIs, but I'm not familiar with it.

I could give a short report some time in gen to show how to scale up 1000 jobs at reana with gitlab ci in the context of doing tuning, but I can do a spoil here that it doesn’t scale up well. Now we’ve been trying to fine tune it (with @sihyunjeon and @shimashimarin)

and I won’t waste this chance to comment on its super inconvenient condor submission which requires you to by hand upload krb5 keytab!