cms-sw / genproductions

Generator fragments for MC production
https://twiki.cern.ch/twiki/bin/view/CMS/GitRepositoryForGenProduction
79 stars 786 forks source link

NLO+Reweight+MadSpin in Master Branch (MG260) #2100

Open qliphy opened 5 years ago

qliphy commented 5 years ago

@kdlong @AndreasAlbert

https://github.com/cms-sw/genproductions/blob/master/bin/MadGraph5_aMCatNLO/runcmsgrid_NLO.sh#L95-L112

It seems to me that reweighting is preformed on the lhe ($runname), independently from madspin ($runlabel)

We probably need to add madspin step after reweighting, similar as done in runcmsgrid_LO.sh for LO sample.

Ref: https://hypernews.cern.ch/HyperNews/CMS/get/generators/4243/1/1/1/1.html

qliphy commented 5 years ago

ping

agrohsje commented 5 years ago

Dear @kdlong @lviliani @qliphy I was checking reweitghing in 2.6.5 but my comment is a general one, so I add it to this issue. When evaluating systematics without reweighting we do --remove_wgts=all For the case of reweighting this is not possible. So first scale variations are added, then the parameter reweighting, than the weights from the systematics tool. We can avoid the redundant scale weights, when setting none = systematics_program in run_card.dat when creating the gridpack. It cannot be done before as the option will be overwritten.

qliphy commented 5 years ago

@agrohsje Yes, for the case of reweighting we should remove "--remove_wgts=all", this is as done in this PR https://github.com/cms-sw/genproductions/pull/2170

Setting "none = systematics_program in run_card.dat" sounds a good idea.

BTW: we should really to close PR2170 asap.

lviliani commented 5 years ago

Hi, let me also add that NLO reweighting seems to not work at the moment running this configuration out of the box (this one is without Madspin) using the master branch: https://github.com/cms-sw/genproductions/tree/master/bin/MadGraph5_aMCatNLO/cards/examples/ggh012j_MassEffects_5f_NLO_FXFX

The problem seems to be related to some missing fortran files when generating events [1]. Also adding the fortran files back do not solve the problem, because in this case Madgraph tries to regenerate the code for the diagrams at the reweighting step.

I will try to investigate more, but if you have any insights please let me know.

[1] REWEIGHT: Extracting the banner ... REWEIGHT: process: p p > x0 REWEIGHT: options: REWEIGHT: Running Reweighting DEBUG: We are in mode False DEBUG: change rwgt_dir rwgt DEBUG: change mode NLO_tree DEBUG: change output 2.0 DEBUG: change model loop_sm-no_b_mass DEBUG: change process p p > h [QCD] DEBUG: change process p p > h j [QCD] --add DEBUG: change process p p > h j j QED<=1 [QCD] --add DEBUG: launch --rwgt_name=topMassEffects REWEIGHT: detected model: HC_NLO_X0_UFO-heft. Loading... REWEIGHT: generating the square matrix element for reweighting REWEIGHT: generate p p > x0 --no_warning=duplicate;define pert_QCD = -5 -4 -3 -2 -1 1 2 3 4 5 21;add process p p > x0 pert_QCD --no_warning=duplicate;define pert_QCD = -5 -4 -3 -2 -1 1 2 3 4 5 21;add process p p > x0 j pert_QCD / t t~ --no_warning=duplicate;define pert_QCD = -5 -4 -3 -2 -1 1 2 3 4 5 21;add process p p > x0 j j pert_QCD / t t~ QED<=1 --add --no_warning=duplicate; DEBUG: Command "reweight cmsgrid -from_cards" interrupted with error: DEBUG: OSError : [Errno 2] No such file or directory: '/tmp/lviliani/mgbasedir/models/template_files/fortran/printout.f' DEBUG: Please report this bug on https://bugs.launchpad.net/mg5amcnlo DEBUG: More information is found in '/tmp/lviliani/process/cmsgrid_tag_2_debug.log'. DEBUG: Please attach this file to your report. REWEIGHT: Original cross-section: 90.306061 +- 0.16502868 pb REWEIGHT: Computed cross-section: DEBUG: Exception AttributeError: "'ReweightInterface' object has no attribute 'id_to_path'" in <bound method ReweightInterface.__del__ of <madgraph.interface.reweight_interface.ReweightInterface object at 0x7f8594d4e690>> ignored DEBUG: quit REWEIGHT: gzipping output file: events.lhe REWEIGHT:

agrohsje commented 5 years ago

Hi @lviliani , usually this happens when the pilotrun didn't succeed. Can you post your gridpack generation log?

lviliani commented 5 years ago

You are right indeed, I overlooked the log and didn't realize there was this compilation error:

File "/local-scratch/lviliani/master/genproductions/bin/MadGraph5_aMCatNLO/ggh012j_MassEffects_5f_NLO_FXFX/ggh012j_MassEffects_5f_NLO_FXFX_gridpack/work/MG5_aMC_v2_6_0/madgraph/various/misc.py", line 480, in compile raise MadGraph5Error, error_text MadGraph5Error: A compilation Error occurs when trying to compile /local-scratch/lviliani/master/genproductions/bin/MadGraph5_aMCatNLO/ggh012j_MassEffects_5f_NLO_FXFX/ggh012j_MassEffects_5f_NLO_FXFX_gridpack/work/processtmp/rwgt/rw_me_second/SubProcesses. The compilation fails with the following output message: ar rcs libMadLoop.a MadLoopParamReader.o MadLoopCommons.o P17_uxcx_huxcx/polynomial.o P6_gg_hg/polynomial.o P5_uux_hgg/polynomial.o P2_gg_huux/polynomial.o P11_gu_hu/polynomial.o P8_uux_huux/polynomial.o P14_uc_huc/polynomial.o P12_gux_hux/polynomial.o P10_gg_h/polynomial.o P4_gux_hgux/polynomial.o P0_gg_hgg/polynomial.o P3_gu_hgu/polynomial.o P7_uu_huu/polynomial.o P15_uux_hccx/polynomial.o P13_uux_hg/polynomial.o P16_ucx_hucx/polynomial.o P9_uxux_huxux/polynomial.o P17_uxcx_huxcx/loop_matrix.o P6_gg_hg/loop_matrix.o P5_uux_hgg/loop_matrix.o P2_gg_huux/loop_matrix.o P11_gu_hu/loop_matrix.o P8_uux_huux/loop_matrix.o P14_uc_huc/loop_matrix.o P12_gux_hux/loop_matrix.o P10_gg_h/loop_matrix.o P4_gux_hgux/loop_matrix.o P0_gg_hgg/loop_matrix.o P3_gu_hgu/loop_matrix.o P7_uu_huu/loop_matrix.o P15_uux_hccx/loop_matrix.o P13_uux_hg/loop_matrix.o P16_ucx_hucx/loop_matrix.o P9_uxux_huxux/loop_matrix.o P17_uxcx_huxcx/improve_ps.o P6_gg_hg/improve_ps.o P5_uux_hgg/improve_ps.o P2_gg_huux/improve_ps.o P11_gu_hu/improve_ps.o P8_uux_huux/improve_ps.o P14_uc_huc/improve_ps.o P12_gux_hux/improve_ps.o P10_gg_h/improve_ps.o P4_gux_hgux/improve_ps.o P0_gg_hgg/improve_ps.o P3_gu_hgu/improve_ps.o P7_uu_huu/improve_ps.o P15_uux_hccx/improve_ps.o P13_uux_hg/improve_ps.o P16_ucx_hucx/improve_ps.o P9_uxux_huxux/improve_ps.o P17_uxcx_huxcx/CT_interface.o P6_gg_hg/CT_interface.o P5_uux_hgg/CT_interface.o P2_gg_huux/CT_interface.o P11_gu_hu/CT_interface.o P8_uux_huux/CT_interface.o P14_uc_huc/CT_interface.o P12_gux_hux/CT_interface.o P10_gg_h/CT_interface.o P4_gux_hgux/CT_interface.o P0_gg_hgg/CT_interface.o P3_gu_hgu/CT_interface.o P7_uu_huu/CT_interface.o P15_uux_hccx/CT_interface.o P13_uux_hg/CT_interface.o P16_ucx_hucx/CT_interface.o P9_uxux_huxux/CT_interface.o P17_uxcx_huxcx/loop_num.o P6_gg_hg/loop_num.o P5_uux_hgg/loop_num.o P2_gg_huux/loop_num.o P11_gu_hu/loop_num.o P8_uux_huux/loop_num.o P14_uc_huc/loop_num.o P12_gux_hux/loop_num.o P10_gg_h/loop_num.o P4_gux_hgux/loop_num.o P0_gg_hgg/loop_num.o P3_gu_hgu/loop_num.o P7_uu_huu/loop_num.o P15_uux_hccx/loop_num.o P13_uux_hg/loop_num.o P16_ucx_hucx/loop_num.o P9_uxux_huxux/loop_num.o P17_uxcx_huxcx/helas_calls_ampb_1.o P6_gg_hg/helas_calls_ampb_1.o P5_uux_hgg/helas_calls_ampb_1.o P2_gg_huux/helas_calls_ampb_1.o P11_gu_hu/helas_calls_ampb_1.o P8_uux_huux/helas_calls_ampb_1.o P14_uc_huc/helas_calls_ampb_1.o P12_gux_hux/helas_calls_ampb_1.o P10_gg_h/helas_calls_ampb_1.o P4_gux_hgux/helas_calls_ampb_1.o P0_gg_hgg/helas_calls_ampb_1.o P3_gu_hgu/helas_calls_ampb_1.o P7_uu_huu/helas_calls_ampb_1.o P15_uux_hccx/helas_calls_ampb_1.o P13_uux_hg/helas_calls_ampb_1.o P16_ucx_hucx/helas_calls_ampb_1.o P9_uxux_huxux/helas_calls_ampb_1.o P17_uxcx_huxcx/mp_compute_loop_coefs.o P6_gg_hg/mp_compute_loop_coefs.o P5_uux_hgg/mp_compute_loop_coefs.o P2_gg_huux/mp_compute_loop_coefs.o P11_gu_hu/mp_compute_loop_coefs.o P8_uux_huux/mp_compute_loop_coefs.o P14_uc_huc/mp_compute_loop_coefs.o P12_gux_hux/mp_compute_loop_coefs.o P10_gg_h/mp_compute_loop_coefs.o P4_gux_hgux/mp_compute_loop_coefs.o P0_gg_hgg/mp_compute_loop_coefs.o P3_gu_hgu/mp_compute_loop_coefs.o P7_uu_huu/mp_compute_loop_coefs.o P15_uux_hccx/mp_compute_loop_coefs.o P13_uux_hg/mp_compute_loop_coefs.o P16_ucx_hucx/mp_compute_loop_coefs.o P9_uxux_huxux/mp_compute_loop_coefs.o P17_uxcx_huxcx/mp_helas_calls_ampb_1.o P6_gg_hg/mp_helas_calls_ampb_1.o P5_uux_hgg/mp_helas_calls_ampb_1.o P2_gg_huux/mp_helas_calls_ampb_1.o P11_gu_hu/mp_helas_calls_ampb_1.o P8_uux_huux/mp_helas_calls_ampb_1.o P14_uc_huc/mp_helas_calls_ampb_1.o P12_gux_hux/mp_helas_calls_ampb_1.o P10_gg_h/mp_helas_calls_ampb_1.o P4_gux_hgux/mp_helas_calls_ampb_1.o P0_gg_hgg/mp_helas_calls_ampb_1.o P3_gu_hgu/mp_helas_calls_ampb_1.o P7_uu_huu/mp_helas_calls_ampb_1.o P15_uux_hccx/mp_helas_calls_ampb_1.o P13_uux_hg/mp_helas_calls_ampb_1.o P16_ucx_hucx/mp_helas_calls_ampb_1.o P9_uxux_huxux/mp_helas_calls_ampb_1.o P17_uxcx_huxcx/coef_construction_1.o P6_gg_hg/coef_construction_1.o P5_uux_hgg/coef_construction_1.o P2_gg_huux/coef_construction_1.o P11_gu_hu/coef_construction_1.o P8_uux_huux/coef_construction_1.o P14_uc_huc/coef_construction_1.o P12_gux_hux/coef_construction_1.o P10_gg_h/coef_construction_1.o P4_gux_hgux/coef_construction_1.o P0_gg_hgg/coef_construction_1.o P3_gu_hgu/coef_construction_1.o P7_uu_huu/coef_construction_1.o P15_uux_hccx/coef_construction_1.o P13_uux_hg/coef_construction_1.o P16_ucx_hucx/coef_construction_1.o P9_uxux_huxux/coef_construction_1.o P17_uxcx_huxcx/loop_CT_calls_1.o P6_gg_hg/loop_CT_calls_1.o P5_uux_hgg/loop_CT_calls_1.o P2_gg_huux/loop_CT_calls_1.o P11_gu_hu/loop_CT_calls_1.o P8_uux_huux/loop_CT_calls_1.o P14_uc_huc/loop_CT_calls_1.o P12_gux_hux/loop_CT_calls_1.o P10_gg_h/loop_CT_calls_1.o P4_gux_hgux/loop_CT_calls_1.o P0_gg_hgg/loop_CT_calls_1.o P3_gu_hgu/loop_CT_calls_1.o P7_uu_huu/loop_CT_calls_1.o P15_uux_hccx/loop_CT_calls_1.o P13_uux_hg/loop_CT_calls_1.o P16_ucx_hucx/loop_CT_calls_1.o P9_uxux_huxux/loop_CT_calls_1.o P17_uxcx_huxcx/mp_coef_construction_1.o P6_gg_hg/mp_coef_construction_1.o P5_uux_hgg/mp_coef_construction_1.o P2_gg_huux/mp_coef_construction_1.o P11_gu_hu/mp_coef_construction_1.o P8_uux_huux/mp_coef_construction_1.o P14_uc_huc/mp_coef_construction_1.o P12_gux_hux/mp_coef_construction_1.o P10_gg_h/mp_coef_construction_1.o P4_gux_hgux/mp_coef_construction_1.o P0_gg_hgg/mp_coef_construction_1.o P3_gu_hgu/mp_coef_construction_1.o P7_uu_huu/mp_coef_construction_1.o P15_uux_hccx/mp_coef_construction_1.o P13_uux_hg/mp_coef_construction_1.o P16_ucx_hucx/mp_coef_construction_1.o P9_uxux_huxux/mp_coef_construction_1.o P17_uxcx_huxcx/TIR_interface.o P6_gg_hg/TIR_interface.o P5_uux_hgg/TIR_interface.o P2_gg_huux/TIR_interface.o P11_gu_hu/TIR_interface.o P8_uux_huux/TIR_interface.o P14_uc_huc/TIR_interface.o P12_gux_hux/TIR_interface.o P10_gg_h/TIR_interface.o P4_gux_hgux/TIR_interface.o P0_gg_hgg/TIR_interface.o P3_gu_hgu/TIR_interface.o P7_uu_huu/TIR_interface.o P15_uux_hccx/TIR_interface.o P13_uux_hg/TIR_interface.o P16_ucx_hucx/TIR_interface.o P9_uxux_huxux/TIR_interface.o P17_uxcx_huxcx/compute_color_flows.o P6_gg_hg/compute_color_flows.o P5_uux_hgg/compute_color_flows.o P2_gg_huux/compute_color_flows.o P11_gu_hu/compute_color_flows.o P8_uux_huux/compute_color_flows.o P14_uc_huc/compute_color_flows.o P12_gux_hux/compute_color_flows.o P10_gg_h/compute_color_flows.o P4_gux_hgux/compute_color_flows.o P0_gg_hgg/compute_color_flows.o P3_gu_hgu/compute_color_flows.o P7_uu_huu/compute_color_flows.o P15_uux_hccx/compute_color_flows.o P13_uux_hg/compute_color_flows.o P16_ucx_hucx/compute_color_flows.o P9_uxux_huxux/compute_color_flows.o make: No rule to make target '../lib/libiregi.a', needed by 'allmatrix2py.so'. Stop. make: Waiting for unfinished jobs.... mv libMadLoop.a ../lib/libMadLoop.a

agrohsje commented 5 years ago

I wanted to check in 2.6.5 but it doesn't like this -- add options in the proc/reweight card. What are they for? In 2.6.1 it seems to accept them, but I never saw them before.

kdlong commented 5 years ago

This process is different than most other reweight processes I know of because it completely changes the model. The --add syntax was a way to have additional jet contributions from the new model. Maybe they changed the syntax?

agrohsje commented 5 years ago

Hi @kdlong , but why this it also appear in the process card: https://github.com/cms-sw/genproductions/blob/master/bin/MadGraph5_aMCatNLO/cards/examples/ggh012j_MassEffects_5f_NLO_FXFX/ggh012j_MassEffects_5f_NLO_FXFX_proc_card.dat What is the meaning there?

kdlong commented 5 years ago

Oops, I think that's just a mistake. That's where it's causing the error? It should just be deleted.

agrohsje commented 5 years ago

Funny enough I only get an error about this line in 2.6.5. My 2.6.0 job is still running. I don't know what it does with the additional syntax. Anyway. Probably better to fix. :-)

agrohsje commented 5 years ago

The line was ignored in 2.6.0, so we should fix the cards. Is it worth understanding why iregi is not there in 2.6.0 or should we just try 2.6.1/2.6.5?

agrohsje commented 5 years ago

I tried 2.6.5 after removing the --add in the proc card but keeping it in reweight card and it worked up to a key error in the reweighting step: REWEIGHT: change model loop_sm-no_b_mass change process p p > h [QCD] change process p p > h j [QCD] change process p p > h j j QED<=1 [QCD] REWEIGHT: Event nb 0 0.027s Command "reweight pilotrun -from_cards" interrupted with error: KeyError : ((1, 21), (1, 21, 21, 25)) Please report this bug on https://bugs.launchpad.net/mg5amcnlo

My understanding of these key errors was so far that this comes from e.g. an inconsistent color configuration between the reweighting and the process step. Could that be because the loop is once resolved and once not?

lviliani commented 5 years ago

Hi, after some more investigation I realized that the iregi compilation error here: https://github.com/cms-sw/genproductions/issues/2100#issuecomment-509527625 is specific to CMS Connect, and happens both with MG260 and MG265.

This is due to set output_dependencies internal that is set in our gridpack generation script when running with CMS Connect, which causes the compilation of iregi.

@agrohsje tested that it works (well, up to the KeyError) outside CMS Connect.

@khurtado Do you know if there's any workaround for this?

lviliani commented 5 years ago

@khurtado From some tests that @covarell is doing with loop-induced processes, it looks like some loop reduction libraries (like Ninja) can't be used in CMS Connect due to the set output_dependencies internal setting, e.g. see the message below: INFO: When using the 'output_dependencies=internal' MG5_aMC option, the (optional) reduction library ninja cannot be employed because it is not distributed with the MG5_aMC code so that it cannot be copied locally. Do you know why we need this setting in CMS Connect? Is there any workaround?

AndreasAlbert commented 5 years ago

@khurtado @lviliani did you ever get around this issue? I am also facing it.

agrohsje commented 5 years ago

hi @AndreasAlbert. which problem exactly are you facing? just to let you know, i am currently trying to get the lo/nlo reweihting syntax generally fixed. there are still several problems with the output storage.

AndreasAlbert commented 5 years ago

@agrohsje I'm referring to the cms connect compilation error.

I'm trying to run an NLO gridpack with merging + madspin + reweighting. I tried condor submission at lxplus, fnal and cmsconnect and cmsconnect seems to work best, but then fails like described above (other also fail, but in different ways). Is the issue youre describing expected to make gridpack_generation fail?

On lxplus, it fails with this error:

DEBUG: reweight pilotrun -from_cards 
DEBUG: For consistency in the FxFx merging, 'jetradius' has been set to 1.0 
DEBUG: Command "reweight pilotrun -from_cards" interrupted with error: 
DEBUG: InvalidCmd : No events file corresponding to pilotrun run.

I can also provide more details on cards + logs if that's helpful.

agrohsje commented 5 years ago

Hi @AndreasAlbert , gridpack generation itself worked for me, but it requires modification to avoid unnecessary systematic weights. The mandatory modifications are in the runcmsgrid script. Do you see that the pilot run worked? Because whenever I saw the above error message, indeed the pilot run failed.

lviliani commented 5 years ago

As far as I know, the problems due to set output_dependencies internal in CMS Connect have not been solved yet. This also caused the pilot run to fail in the tests I did.

AndreasAlbert commented 5 years ago

@agrohsje presumably the pilot run failed. The log file from lxplus shows this:

(...)
INFO: P1_bxbx_xdxdxbxbx_no_hwpwpza 
INFO:  Result for test_ME: 
INFO:    Passed. 
INFO:  Result for test_MC: 
INFO:    Passed. 
INFO:  Result for check_poles: 
INFO:    Poles successfully cancel for 20 points over 20 (tolerance=1.0e-05) 
INFO: Starting run 
cluster handling will be done with PLUGIN: CMS_CLUSTER
INFO: Cleaning previous results 
INFO: Generating events without running the shower. 
INFO: Setting up grids 
Start waiting for update. (more info in debug mode)
quit
INFO:  
preparing reweighting step
preparing reweighting step
Running MG5 in debug mode
(...)

No errors are shown, but I guess the part where actual condor jobs are submitted, waited for, and retrieved is missing between "Generating events..." and "quit". I figure that part fails quietly? Will have to have a closer look to diagnose.

On CMS connect, I believe the pilot run succeeds and then reweighting fails as described by @lviliani. Given that (in my experience) CMS connect is overall the most reliable way of generating gridpacks, I think it would be a shame to not support this specific use case. What do you think @agrohsje? Is there any way to work around this @khurtado?

khurtado commented 5 years ago

@AndreasAlbert There is no clean solution for this at present. However, depending on how stressful the CODEGEN step is, you can likely run the standard gridpack_generation script (not the CMS Connect one) in condor mode, which will run the CODEGEN step on the submit node and then submit jobs for the INTEGRATE mode. This won't use the set output_dependencies internal, so it should work. This obviously is not the recommended method, as it won't scale well if too many gridpacks are executing the code generation stage on the submit node, so you have to be careful with the load, but it is a workaround if there is no other solution.

For the CMS Connect script itself, you can also try commeting these lines and replacing the "INTEGRATE" work in this line with "ALL". That will setup the environment variable to deal with failed jobs, so it should be a more robust approach than just running the gridpack_generation bash script in condor mode.

agrohsje commented 5 years ago

@AndreasAlbert are you available to test and modify the TWiki accordingly if the work arounds helped? @khurtado What is needed to solve the set output_dependencies internal problem in CMS Connect ?

AndreasAlbert commented 5 years ago

@khurtado Thanks for this suggestion, I will try it and report back.

AndreasAlbert commented 5 years ago

@khurtado using submit_condor instead of submit_cmsconnect works in principle. I do run into issues with the condor jobs, though, where they go into HOLD with the hold message:

Cannot access initial working directory /local-scratch/aalbert/monojet/260/genproductions/bin/MadGraph5_aMCatNLO/DMSimp_monojet_NLO_Axial_GQ0p25_GDM1p0_MY1-1000p0_MXd-1p0/DMSimp_monojet_NLO_Axial_GQ0p25_GDM1p0_MY1-1000p0_MXd-1p0_gridpack/work/processtmp/SubProcesses/P1_gcx_xdxdxcxg_no_hwpwpza: No such file or directory"

upon which gridpack generation fails. This error persists after retrying, so it does not seem to be a transient file system issue. Interestingly, the directory P1_gcx_xdxdxcxg_no_hwpwpza does seem to exist:

find DMSimp_monojet_NLO_Axial_GQ0p25_GDM1p0_MY1-1000p0_MXd-1p0 -name P1_gcx_xdxdxcxg_no_hwpwpza

> DMSimp_monojet_NLO_Axial_GQ0p25_GDM1p0_MY1-1000p0_MXd-1p0/DMSimp_monojet_NLO_Axial_GQ0p25_GDM1p0_MY1-1000p0_MXd-1p0_gridpack/work/gridpack/process/SubProcesses/P1_gcx_xdxdxcxg_no_hwpwpza

Since gridpack_generation moves all the folders around while dying, I cannot guarantee that this folder was in the right place pre-mortem, but I would assume so. @khurtado have you encountered (something like) this before?

All the relevant files are available for inspection:

/stash2/user/aalbert/public/share/2019-10-29_reweighting
kdlong commented 5 years ago

This is on lxplus or fnal? The problem is probably with the permissions to read the file, either to initially transfer to the worker node or to access it during the job. At lxplus I believe the only mechanism for the nodes or the scheduler to have read access to files is from afs. I'm not sure about fnal.

AndreasAlbert commented 5 years ago

@kdlong this is using the CMSconnect condor system, but without using the CMSconnect specific script as proposed by @khurtado above. So everything's on login.uscms.org. You are right though, I hadn't considered incorrect permissions

AndreasAlbert commented 4 years ago

@khurtado I finally got your suggestion to work by copying some of the condor options from the cmsconnect script to the regular condor script. This works, but is not ideal, because now the main job and thus a lot of the compilation load runs on the login node.

My question is this: If the codegen step is the reason we cannot have working reweighting, can we submit what I am now running on the login node as a condor job? Without doing codegen separately? I realize that I don't know what the rationale for separate codegen was in the first plcae.

kdlong commented 4 years ago

@AndreasAlbert if I understand the question, the logic would be to avoid having the central job that does code gen + following job submission running on the condor machine. You can't take for granted that the worker machines in the cluster will have rights to submit condor jobs to the other machines in the cluster. When I ran jobs at Wisconsin I always just did the codegen +submission on the our login machine. Is the problem with this approach that the jobs take too long or are too computing intensive and get killed?

khurtado commented 4 years ago

@AndreasAlbert @kdlong : I'm out of the office until next Monday (25th), but to quickly reply to the question: Processing CODEGEN separately was done in an effort to reduce the load on the login node as much as possible. A couple gridpacks running is not a problem, but we have had +50 different variations running at times, which can represent a problem and exhaust resources (harming other non-gridpack users as well). The login node is supposed to be a submission point to condor only. Making CODEGEN run on a worker doesn't fully resolve the issue (you still compile some things and MADSPIN can also run on the submit node), but it helps.

kdlong commented 4 years ago

Thanks a lot @khurtado ! What about @AndreasAlbert's question about submitting the master job to a submit node. Is it expected to work? Would a job running on condor machine be able to submit jobs to the same cluster?

khurtado commented 4 years ago

@kdlong: That approach is not expected to work in the Global Pool. In a regular condor pool, workers often work as schedds for the local pool too, so that can work if such configuration is deployed. In the Global Pool, workers are glideins submitted through a factory and they do not act as schedds for the Global Pool, so you can run work, but can't schedule more work.

AndreasAlbert commented 4 years ago

@khurtado thanks for clarifying. The problem you describe, i.e. 'high load on the login node' is the one I was concerned about.

so to recap:

  1. Separating codegen does not work with madspin.

  2. Running the master job + codegen on the login node uses too much resources.

  3. Submitting the master job to condor is not possible because it cannot submit more jobs from the worker node.

Unless we can go around one of these points, I don't see how to make it work.

kdlong commented 4 years ago

Separating codegen does not work with madspin. --> Is this a fundamental block? Sorry I didn't follow closely.

Option 3 might work at other sites, like Fermilab. Option 2 might as well at other sites.

kdlong commented 4 years ago

This thread makes it seem like it's not supported by the authors:

https://answers.launchpad.net/mg5amcnlo/+question/288618

We can ask if they have any intention to support it in the future