cms-sw / genproductions

Generator fragments for MC production
https://twiki.cern.ch/twiki/bin/view/CMS/GitRepositoryForGenProduction
75 stars 774 forks source link

Condor submission failed at cmslpc #2101

Open qliphy opened 5 years ago

qliphy commented 5 years ago

@AndreasAlbert @kdlong

It seems the gridpack condor submission script does not work at cmslpc for most of the nodes. This might be due to "Moving to a central scheduling model"@cmslpc as mentioned here

The error message is below: raise ClusterManagmentError, 'could not import htcondor python API: \n%s' % error madgraph.various.cluster.ClusterManagmentError: could not import htcondor python API:

It seems we should update "source_condor.sh" to have HTCondor python bindings in submission step?

Ref: https://hypernews.cern.ch/HyperNews/CMS/get/generators/4243/1/1/1/1/1/1.html

qliphy commented 5 years ago

@Saptaparna is also checking this.

Saptaparna commented 5 years ago

Yes, I have submitted a ticket, after verifying that condor submission still works for simple submission scripts. Will provide update based on feedback from LPC experts.

Saptaparna commented 5 years ago

The origin of the problem stems from the fact that condor_submit has changed under the hood. The condor_submit command is no longer the HTCondor command but the full redefinition of condor_submit can be found by doing: more /usr/local/bin/condor_submit. Also, the gridpack generation script assumes a local condor Schedd but the condor refactor update at the LPC was specifically geared toward moving away from that setup.

qliphy commented 5 years ago

@Saptaparna Thanks for the information! Do you have a workaround? And if possible can you update our script to make it work at cmslpc?

Saptaparna commented 5 years ago

@qliphy The obvious workaround is to update condor_submit with its cmslpc version. There is an added complication of making sure that schedd name is not hard coded and list of schedulers is provided (but this list may change over time). I am trying to deal with this now and following the suggestions of some of the experts here at the LPC.

qliphy commented 5 years ago

Suggestions from FNAL computing expert as below for your references:


You CAN run on CMS Connect to T3_US_FNALLPC, so that's a really good point of Dave's, if everyone running gridpack just submits through CMS Connect, their jobs can run at T3_US_FNALLPC without rewriting the code.

One has to be sure that your CMS grid certificate is mapped to your FNAL username to be allowed in from CMS Connect and CRAB jobs, which I believe is true for all the gridpack users in this email.

-Marguerite

Dr. Marguerite Tonjes
LPC Computing Support https://lpc.fnal.gov/computing tonjes@fnal.gov
Skype: phMarguerite CMS Experiment Mattermost: @belt office: (630) 840-2859 FNAL WH11E

On Mar 25, 2019, at 10:48 AM, David A Mason dmason@fnal.gov wrote:

Though I would have a better question -- if it works through CMS Connect why is there a rewrite?

On Mar 25, 2019, at 10:03 AM, Marguerite Tonjes tonjes@fnal.gov wrote:

Yes, you are affected by the same gridpack issue Sapta has found here: https://github.com/cms-sw/genproductions/issues/2101

Basically the condor refactor has "condor_submit" as a python wrapper script which talks to the negotiators and then the schedulers. The condor schedulers are no longer located on the interactive nodes.

I was told that gridpack can be run in CMS Connect, so you won't be out of CPUs while the rewrite is happening.

We do encourage your group to reach out if you need help in re-writing the scripts.

-Marguerite

Dr. Marguerite Tonjes
LPC Computing Support https://lpc.fnal.gov/computing tonjes@fnal.gov
Skype: phMarguerite CMS Experiment Mattermost: @belt office: (630) 840-2859 FNAL WH11E