Open qliphy opened 5 years ago
ping
Dear @kdlong @lviliani @qliphy I was checking reweitghing in 2.6.5 but my comment is a general one, so I add it to this issue. When evaluating systematics without reweighting we do --remove_wgts=all For the case of reweighting this is not possible. So first scale variations are added, then the parameter reweighting, than the weights from the systematics tool. We can avoid the redundant scale weights, when setting none = systematics_program in run_card.dat when creating the gridpack. It cannot be done before as the option will be overwritten.
@agrohsje Yes, for the case of reweighting we should remove "--remove_wgts=all", this is as done in this PR https://github.com/cms-sw/genproductions/pull/2170
Setting "none = systematics_program in run_card.dat" sounds a good idea.
BTW: we should really to close PR2170 asap.
Hi, let me also add that NLO reweighting seems to not work at the moment running this configuration out of the box (this one is without Madspin) using the master branch: https://github.com/cms-sw/genproductions/tree/master/bin/MadGraph5_aMCatNLO/cards/examples/ggh012j_MassEffects_5f_NLO_FXFX
The problem seems to be related to some missing fortran files when generating events [1]. Also adding the fortran files back do not solve the problem, because in this case Madgraph tries to regenerate the code for the diagrams at the reweighting step.
I will try to investigate more, but if you have any insights please let me know.
[1] REWEIGHT: Extracting the banner ...
REWEIGHT: process: p p > x0
REWEIGHT: options:
REWEIGHT: Running Reweighting
DEBUG: We are in mode False
DEBUG: change rwgt_dir rwgt
DEBUG: change mode NLO_tree
DEBUG: change output 2.0
DEBUG: change model loop_sm-no_b_mass
DEBUG: change process p p > h [QCD]
DEBUG: change process p p > h j [QCD] --add
DEBUG: change process p p > h j j QED<=1 [QCD] --add
DEBUG: launch --rwgt_name=topMassEffects
REWEIGHT: detected model: HC_NLO_X0_UFO-heft. Loading...
REWEIGHT: generating the square matrix element for reweighting
REWEIGHT: generate p p > x0 --no_warning=duplicate;define pert_QCD = -5 -4 -3 -2 -1 1 2 3 4 5 21;add process p p > x0 pert_QCD --no_warning=duplicate;define pert_QCD = -5 -4 -3 -2 -1 1 2 3 4 5 21;add process p p > x0 j pert_QCD / t t~ --no_warning=duplicate;define pert_QCD = -5 -4 -3 -2 -1 1 2 3 4 5 21;add process p p > x0 j j pert_QCD / t t~ QED<=1 --add --no_warning=duplicate;
DEBUG: Command "reweight cmsgrid -from_cards" interrupted with error:
DEBUG: OSError : [Errno 2] No such file or directory: '/tmp/lviliani/mgbasedir/models/template_files/fortran/printout.f'
DEBUG: Please report this bug on https://bugs.launchpad.net/mg5amcnlo
DEBUG: More information is found in '/tmp/lviliani/process/cmsgrid_tag_2_debug.log'.
DEBUG: Please attach this file to your report.
REWEIGHT: Original cross-section: 90.306061 +- 0.16502868 pb
REWEIGHT: Computed cross-section:
DEBUG: Exception AttributeError: "'ReweightInterface' object has no attribute 'id_to_path'" in <bound method ReweightInterface.__del__ of <madgraph.interface.reweight_interface.ReweightInterface object at 0x7f8594d4e690>> ignored
DEBUG: quit
REWEIGHT: gzipping output file: events.lhe
REWEIGHT:
Hi @lviliani , usually this happens when the pilotrun didn't succeed. Can you post your gridpack generation log?
You are right indeed, I overlooked the log and didn't realize there was this compilation error:
File "/local-scratch/lviliani/master/genproductions/bin/MadGraph5_aMCatNLO/ggh012j_MassEffects_5f_NLO_FXFX/ggh012j_MassEffects_5f_NLO_FXFX_gridpack/work/MG5_aMC_v2_6_0/madgraph/various/misc.py", line 480, in compile raise MadGraph5Error, error_text MadGraph5Error: A compilation Error occurs when trying to compile /local-scratch/lviliani/master/genproductions/bin/MadGraph5_aMCatNLO/ggh012j_MassEffects_5f_NLO_FXFX/ggh012j_MassEffects_5f_NLO_FXFX_gridpack/work/processtmp/rwgt/rw_me_second/SubProcesses. The compilation fails with the following output message: ar rcs libMadLoop.a MadLoopParamReader.o MadLoopCommons.o P17_uxcx_huxcx/polynomial.o P6_gg_hg/polynomial.o P5_uux_hgg/polynomial.o P2_gg_huux/polynomial.o P11_gu_hu/polynomial.o P8_uux_huux/polynomial.o P14_uc_huc/polynomial.o P12_gux_hux/polynomial.o P10_gg_h/polynomial.o P4_gux_hgux/polynomial.o P0_gg_hgg/polynomial.o P3_gu_hgu/polynomial.o P7_uu_huu/polynomial.o P15_uux_hccx/polynomial.o P13_uux_hg/polynomial.o P16_ucx_hucx/polynomial.o P9_uxux_huxux/polynomial.o P17_uxcx_huxcx/loop_matrix.o P6_gg_hg/loop_matrix.o P5_uux_hgg/loop_matrix.o P2_gg_huux/loop_matrix.o P11_gu_hu/loop_matrix.o P8_uux_huux/loop_matrix.o P14_uc_huc/loop_matrix.o P12_gux_hux/loop_matrix.o P10_gg_h/loop_matrix.o P4_gux_hgux/loop_matrix.o P0_gg_hgg/loop_matrix.o P3_gu_hgu/loop_matrix.o P7_uu_huu/loop_matrix.o P15_uux_hccx/loop_matrix.o P13_uux_hg/loop_matrix.o P16_ucx_hucx/loop_matrix.o P9_uxux_huxux/loop_matrix.o P17_uxcx_huxcx/improve_ps.o P6_gg_hg/improve_ps.o P5_uux_hgg/improve_ps.o P2_gg_huux/improve_ps.o P11_gu_hu/improve_ps.o P8_uux_huux/improve_ps.o P14_uc_huc/improve_ps.o P12_gux_hux/improve_ps.o P10_gg_h/improve_ps.o P4_gux_hgux/improve_ps.o P0_gg_hgg/improve_ps.o P3_gu_hgu/improve_ps.o P7_uu_huu/improve_ps.o P15_uux_hccx/improve_ps.o P13_uux_hg/improve_ps.o P16_ucx_hucx/improve_ps.o P9_uxux_huxux/improve_ps.o P17_uxcx_huxcx/CT_interface.o P6_gg_hg/CT_interface.o P5_uux_hgg/CT_interface.o P2_gg_huux/CT_interface.o P11_gu_hu/CT_interface.o P8_uux_huux/CT_interface.o P14_uc_huc/CT_interface.o P12_gux_hux/CT_interface.o P10_gg_h/CT_interface.o P4_gux_hgux/CT_interface.o P0_gg_hgg/CT_interface.o P3_gu_hgu/CT_interface.o P7_uu_huu/CT_interface.o P15_uux_hccx/CT_interface.o P13_uux_hg/CT_interface.o P16_ucx_hucx/CT_interface.o P9_uxux_huxux/CT_interface.o P17_uxcx_huxcx/loop_num.o P6_gg_hg/loop_num.o P5_uux_hgg/loop_num.o P2_gg_huux/loop_num.o P11_gu_hu/loop_num.o P8_uux_huux/loop_num.o P14_uc_huc/loop_num.o P12_gux_hux/loop_num.o P10_gg_h/loop_num.o P4_gux_hgux/loop_num.o P0_gg_hgg/loop_num.o P3_gu_hgu/loop_num.o P7_uu_huu/loop_num.o P15_uux_hccx/loop_num.o P13_uux_hg/loop_num.o P16_ucx_hucx/loop_num.o P9_uxux_huxux/loop_num.o P17_uxcx_huxcx/helas_calls_ampb_1.o P6_gg_hg/helas_calls_ampb_1.o P5_uux_hgg/helas_calls_ampb_1.o P2_gg_huux/helas_calls_ampb_1.o P11_gu_hu/helas_calls_ampb_1.o P8_uux_huux/helas_calls_ampb_1.o P14_uc_huc/helas_calls_ampb_1.o P12_gux_hux/helas_calls_ampb_1.o P10_gg_h/helas_calls_ampb_1.o P4_gux_hgux/helas_calls_ampb_1.o P0_gg_hgg/helas_calls_ampb_1.o P3_gu_hgu/helas_calls_ampb_1.o P7_uu_huu/helas_calls_ampb_1.o P15_uux_hccx/helas_calls_ampb_1.o P13_uux_hg/helas_calls_ampb_1.o P16_ucx_hucx/helas_calls_ampb_1.o P9_uxux_huxux/helas_calls_ampb_1.o P17_uxcx_huxcx/mp_compute_loop_coefs.o P6_gg_hg/mp_compute_loop_coefs.o P5_uux_hgg/mp_compute_loop_coefs.o P2_gg_huux/mp_compute_loop_coefs.o P11_gu_hu/mp_compute_loop_coefs.o P8_uux_huux/mp_compute_loop_coefs.o P14_uc_huc/mp_compute_loop_coefs.o P12_gux_hux/mp_compute_loop_coefs.o P10_gg_h/mp_compute_loop_coefs.o P4_gux_hgux/mp_compute_loop_coefs.o P0_gg_hgg/mp_compute_loop_coefs.o P3_gu_hgu/mp_compute_loop_coefs.o P7_uu_huu/mp_compute_loop_coefs.o P15_uux_hccx/mp_compute_loop_coefs.o P13_uux_hg/mp_compute_loop_coefs.o P16_ucx_hucx/mp_compute_loop_coefs.o P9_uxux_huxux/mp_compute_loop_coefs.o P17_uxcx_huxcx/mp_helas_calls_ampb_1.o P6_gg_hg/mp_helas_calls_ampb_1.o P5_uux_hgg/mp_helas_calls_ampb_1.o P2_gg_huux/mp_helas_calls_ampb_1.o P11_gu_hu/mp_helas_calls_ampb_1.o P8_uux_huux/mp_helas_calls_ampb_1.o P14_uc_huc/mp_helas_calls_ampb_1.o P12_gux_hux/mp_helas_calls_ampb_1.o P10_gg_h/mp_helas_calls_ampb_1.o P4_gux_hgux/mp_helas_calls_ampb_1.o P0_gg_hgg/mp_helas_calls_ampb_1.o P3_gu_hgu/mp_helas_calls_ampb_1.o P7_uu_huu/mp_helas_calls_ampb_1.o P15_uux_hccx/mp_helas_calls_ampb_1.o P13_uux_hg/mp_helas_calls_ampb_1.o P16_ucx_hucx/mp_helas_calls_ampb_1.o P9_uxux_huxux/mp_helas_calls_ampb_1.o P17_uxcx_huxcx/coef_construction_1.o P6_gg_hg/coef_construction_1.o P5_uux_hgg/coef_construction_1.o P2_gg_huux/coef_construction_1.o P11_gu_hu/coef_construction_1.o P8_uux_huux/coef_construction_1.o P14_uc_huc/coef_construction_1.o P12_gux_hux/coef_construction_1.o P10_gg_h/coef_construction_1.o P4_gux_hgux/coef_construction_1.o P0_gg_hgg/coef_construction_1.o P3_gu_hgu/coef_construction_1.o P7_uu_huu/coef_construction_1.o P15_uux_hccx/coef_construction_1.o P13_uux_hg/coef_construction_1.o P16_ucx_hucx/coef_construction_1.o P9_uxux_huxux/coef_construction_1.o P17_uxcx_huxcx/loop_CT_calls_1.o P6_gg_hg/loop_CT_calls_1.o P5_uux_hgg/loop_CT_calls_1.o P2_gg_huux/loop_CT_calls_1.o P11_gu_hu/loop_CT_calls_1.o P8_uux_huux/loop_CT_calls_1.o P14_uc_huc/loop_CT_calls_1.o P12_gux_hux/loop_CT_calls_1.o P10_gg_h/loop_CT_calls_1.o P4_gux_hgux/loop_CT_calls_1.o P0_gg_hgg/loop_CT_calls_1.o P3_gu_hgu/loop_CT_calls_1.o P7_uu_huu/loop_CT_calls_1.o P15_uux_hccx/loop_CT_calls_1.o P13_uux_hg/loop_CT_calls_1.o P16_ucx_hucx/loop_CT_calls_1.o P9_uxux_huxux/loop_CT_calls_1.o P17_uxcx_huxcx/mp_coef_construction_1.o P6_gg_hg/mp_coef_construction_1.o P5_uux_hgg/mp_coef_construction_1.o P2_gg_huux/mp_coef_construction_1.o P11_gu_hu/mp_coef_construction_1.o P8_uux_huux/mp_coef_construction_1.o P14_uc_huc/mp_coef_construction_1.o P12_gux_hux/mp_coef_construction_1.o P10_gg_h/mp_coef_construction_1.o P4_gux_hgux/mp_coef_construction_1.o P0_gg_hgg/mp_coef_construction_1.o P3_gu_hgu/mp_coef_construction_1.o P7_uu_huu/mp_coef_construction_1.o P15_uux_hccx/mp_coef_construction_1.o P13_uux_hg/mp_coef_construction_1.o P16_ucx_hucx/mp_coef_construction_1.o P9_uxux_huxux/mp_coef_construction_1.o P17_uxcx_huxcx/TIR_interface.o P6_gg_hg/TIR_interface.o P5_uux_hgg/TIR_interface.o P2_gg_huux/TIR_interface.o P11_gu_hu/TIR_interface.o P8_uux_huux/TIR_interface.o P14_uc_huc/TIR_interface.o P12_gux_hux/TIR_interface.o P10_gg_h/TIR_interface.o P4_gux_hgux/TIR_interface.o P0_gg_hgg/TIR_interface.o P3_gu_hgu/TIR_interface.o P7_uu_huu/TIR_interface.o P15_uux_hccx/TIR_interface.o P13_uux_hg/TIR_interface.o P16_ucx_hucx/TIR_interface.o P9_uxux_huxux/TIR_interface.o P17_uxcx_huxcx/compute_color_flows.o P6_gg_hg/compute_color_flows.o P5_uux_hgg/compute_color_flows.o P2_gg_huux/compute_color_flows.o P11_gu_hu/compute_color_flows.o P8_uux_huux/compute_color_flows.o P14_uc_huc/compute_color_flows.o P12_gux_hux/compute_color_flows.o P10_gg_h/compute_color_flows.o P4_gux_hgux/compute_color_flows.o P0_gg_hgg/compute_color_flows.o P3_gu_hgu/compute_color_flows.o P7_uu_huu/compute_color_flows.o P15_uux_hccx/compute_color_flows.o P13_uux_hg/compute_color_flows.o P16_ucx_hucx/compute_color_flows.o P9_uxux_huxux/compute_color_flows.o make: No rule to make target '../lib/libiregi.a', needed by 'allmatrix2py.so'. Stop. make: Waiting for unfinished jobs.... mv libMadLoop.a ../lib/libMadLoop.a
I wanted to check in 2.6.5 but it doesn't like this -- add options in the proc/reweight card. What are they for? In 2.6.1 it seems to accept them, but I never saw them before.
This process is different than most other reweight processes I know of because it completely changes the model. The --add syntax was a way to have additional jet contributions from the new model. Maybe they changed the syntax?
Hi @kdlong , but why this it also appear in the process card: https://github.com/cms-sw/genproductions/blob/master/bin/MadGraph5_aMCatNLO/cards/examples/ggh012j_MassEffects_5f_NLO_FXFX/ggh012j_MassEffects_5f_NLO_FXFX_proc_card.dat What is the meaning there?
Oops, I think that's just a mistake. That's where it's causing the error? It should just be deleted.
Funny enough I only get an error about this line in 2.6.5. My 2.6.0 job is still running. I don't know what it does with the additional syntax. Anyway. Probably better to fix. :-)
The line was ignored in 2.6.0, so we should fix the cards. Is it worth understanding why iregi is not there in 2.6.0 or should we just try 2.6.1/2.6.5?
I tried 2.6.5 after removing the --add in the proc card but keeping it in reweight card and it worked up to a key error in the reweighting step: REWEIGHT: change model loop_sm-no_b_mass change process p p > h [QCD] change process p p > h j [QCD] change process p p > h j j QED<=1 [QCD] REWEIGHT: Event nb 0 0.027s Command "reweight pilotrun -from_cards" interrupted with error: KeyError : ((1, 21), (1, 21, 21, 25)) Please report this bug on https://bugs.launchpad.net/mg5amcnlo
My understanding of these key errors was so far that this comes from e.g. an inconsistent color configuration between the reweighting and the process step. Could that be because the loop is once resolved and once not?
Hi, after some more investigation I realized that the iregi compilation error here: https://github.com/cms-sw/genproductions/issues/2100#issuecomment-509527625 is specific to CMS Connect, and happens both with MG260 and MG265.
This is due to set output_dependencies internal
that is set in our gridpack generation script when running with CMS Connect, which causes the compilation of iregi.
@agrohsje tested that it works (well, up to the KeyError) outside CMS Connect.
@khurtado Do you know if there's any workaround for this?
@khurtado From some tests that @covarell is doing with loop-induced processes, it looks like some loop reduction libraries (like Ninja) can't be used in CMS Connect due to the set output_dependencies internal
setting, e.g. see the message below:
INFO: When using the 'output_dependencies=internal' MG5_aMC option, the (optional) reduction library ninja cannot be employed because it is not distributed with the MG5_aMC code so that it cannot be copied locally.
Do you know why we need this setting in CMS Connect? Is there any workaround?
@khurtado @lviliani did you ever get around this issue? I am also facing it.
hi @AndreasAlbert. which problem exactly are you facing? just to let you know, i am currently trying to get the lo/nlo reweihting syntax generally fixed. there are still several problems with the output storage.
@agrohsje I'm referring to the cms connect compilation error.
I'm trying to run an NLO gridpack with merging + madspin + reweighting. I tried condor submission at lxplus, fnal and cmsconnect and cmsconnect seems to work best, but then fails like described above (other also fail, but in different ways). Is the issue youre describing expected to make gridpack_generation fail?
On lxplus, it fails with this error:
DEBUG: reweight pilotrun -from_cards
DEBUG: For consistency in the FxFx merging, 'jetradius' has been set to 1.0
DEBUG: Command "reweight pilotrun -from_cards" interrupted with error:
DEBUG: InvalidCmd : No events file corresponding to pilotrun run.
I can also provide more details on cards + logs if that's helpful.
Hi @AndreasAlbert , gridpack generation itself worked for me, but it requires modification to avoid unnecessary systematic weights. The mandatory modifications are in the runcmsgrid script. Do you see that the pilot run worked? Because whenever I saw the above error message, indeed the pilot run failed.
As far as I know, the problems due to set output_dependencies internal
in CMS Connect have not been solved yet. This also caused the pilot run to fail in the tests I did.
@agrohsje presumably the pilot run failed. The log file from lxplus shows this:
(...)
INFO: P1_bxbx_xdxdxbxbx_no_hwpwpza
INFO: Result for test_ME:
INFO: Passed.
INFO: Result for test_MC:
INFO: Passed.
INFO: Result for check_poles:
INFO: Poles successfully cancel for 20 points over 20 (tolerance=1.0e-05)
INFO: Starting run
cluster handling will be done with PLUGIN: CMS_CLUSTER
INFO: Cleaning previous results
INFO: Generating events without running the shower.
INFO: Setting up grids
Start waiting for update. (more info in debug mode)
quit
INFO:
preparing reweighting step
preparing reweighting step
Running MG5 in debug mode
(...)
No errors are shown, but I guess the part where actual condor jobs are submitted, waited for, and retrieved is missing between "Generating events..." and "quit". I figure that part fails quietly? Will have to have a closer look to diagnose.
On CMS connect, I believe the pilot run succeeds and then reweighting fails as described by @lviliani. Given that (in my experience) CMS connect is overall the most reliable way of generating gridpacks, I think it would be a shame to not support this specific use case. What do you think @agrohsje? Is there any way to work around this @khurtado?
@AndreasAlbert There is no clean solution for this at present. However, depending on how stressful the CODEGEN step is, you can likely run the standard gridpack_generation script (not the CMS Connect one) in condor mode, which will run the CODEGEN
step on the submit node and then submit jobs for the INTEGRATE mode. This won't use the set output_dependencies internal
, so it should work. This obviously is not the recommended method, as it won't scale well if too many gridpacks are executing the code generation stage on the submit node, so you have to be careful with the load, but it is a workaround if there is no other solution.
For the CMS Connect script itself, you can also try commeting these lines and replacing the "INTEGRATE" work in this line with "ALL". That will setup the environment variable to deal with failed jobs, so it should be a more robust approach than just running the gridpack_generation bash script in condor mode.
@AndreasAlbert are you available to test and modify the TWiki accordingly if the work arounds helped? @khurtado What is needed to solve the set output_dependencies internal problem in CMS Connect ?
@khurtado Thanks for this suggestion, I will try it and report back.
@khurtado using submit_condor instead of submit_cmsconnect works in principle. I do run into issues with the condor jobs, though, where they go into HOLD with the hold message:
Cannot access initial working directory /local-scratch/aalbert/monojet/260/genproductions/bin/MadGraph5_aMCatNLO/DMSimp_monojet_NLO_Axial_GQ0p25_GDM1p0_MY1-1000p0_MXd-1p0/DMSimp_monojet_NLO_Axial_GQ0p25_GDM1p0_MY1-1000p0_MXd-1p0_gridpack/work/processtmp/SubProcesses/P1_gcx_xdxdxcxg_no_hwpwpza: No such file or directory"
upon which gridpack generation fails. This error persists after retrying, so it does not seem to be a transient file system issue. Interestingly, the directory P1_gcx_xdxdxcxg_no_hwpwpza
does seem to exist:
find DMSimp_monojet_NLO_Axial_GQ0p25_GDM1p0_MY1-1000p0_MXd-1p0 -name P1_gcx_xdxdxcxg_no_hwpwpza
> DMSimp_monojet_NLO_Axial_GQ0p25_GDM1p0_MY1-1000p0_MXd-1p0/DMSimp_monojet_NLO_Axial_GQ0p25_GDM1p0_MY1-1000p0_MXd-1p0_gridpack/work/gridpack/process/SubProcesses/P1_gcx_xdxdxcxg_no_hwpwpza
Since gridpack_generation moves all the folders around while dying, I cannot guarantee that this folder was in the right place pre-mortem, but I would assume so. @khurtado have you encountered (something like) this before?
All the relevant files are available for inspection:
/stash2/user/aalbert/public/share/2019-10-29_reweighting
This is on lxplus or fnal? The problem is probably with the permissions to read the file, either to initially transfer to the worker node or to access it during the job. At lxplus I believe the only mechanism for the nodes or the scheduler to have read access to files is from afs. I'm not sure about fnal.
@kdlong this is using the CMSconnect condor system, but without using the CMSconnect specific script as proposed by @khurtado above. So everything's on login.uscms.org
. You are right though, I hadn't considered incorrect permissions
@khurtado I finally got your suggestion to work by copying some of the condor options from the cmsconnect script to the regular condor script. This works, but is not ideal, because now the main job and thus a lot of the compilation load runs on the login node.
My question is this: If the codegen step is the reason we cannot have working reweighting, can we submit what I am now running on the login node as a condor job? Without doing codegen separately? I realize that I don't know what the rationale for separate codegen was in the first plcae.
@AndreasAlbert if I understand the question, the logic would be to avoid having the central job that does code gen + following job submission running on the condor machine. You can't take for granted that the worker machines in the cluster will have rights to submit condor jobs to the other machines in the cluster. When I ran jobs at Wisconsin I always just did the codegen +submission on the our login machine. Is the problem with this approach that the jobs take too long or are too computing intensive and get killed?
@AndreasAlbert @kdlong : I'm out of the office until next Monday (25th), but to quickly reply to the question: Processing CODEGEN
separately was done in an effort to reduce the load on the login node as much as possible. A couple gridpacks running is not a problem, but we have had +50 different variations running at times, which can represent a problem and exhaust resources (harming other non-gridpack users as well). The login node is supposed to be a submission point to condor only. Making CODEGEN
run on a worker doesn't fully resolve the issue (you still compile some things and MADSPIN
can also run on the submit node), but it helps.
Thanks a lot @khurtado ! What about @AndreasAlbert's question about submitting the master job to a submit node. Is it expected to work? Would a job running on condor machine be able to submit jobs to the same cluster?
@kdlong: That approach is not expected to work in the Global Pool. In a regular condor pool, workers often work as schedds for the local pool too, so that can work if such configuration is deployed. In the Global Pool, workers are glideins submitted through a factory and they do not act as schedds for the Global Pool, so you can run work, but can't schedule more work.
@khurtado thanks for clarifying. The problem you describe, i.e. 'high load on the login node' is the one I was concerned about.
so to recap:
Separating codegen does not work with madspin.
Running the master job + codegen on the login node uses too much resources.
Submitting the master job to condor is not possible because it cannot submit more jobs from the worker node.
Unless we can go around one of these points, I don't see how to make it work.
Separating codegen does not work with madspin. --> Is this a fundamental block? Sorry I didn't follow closely.
Option 3 might work at other sites, like Fermilab. Option 2 might as well at other sites.
This thread makes it seem like it's not supported by the authors:
https://answers.launchpad.net/mg5amcnlo/+question/288618
We can ask if they have any intention to support it in the future
@kdlong @AndreasAlbert
https://github.com/cms-sw/genproductions/blob/master/bin/MadGraph5_aMCatNLO/runcmsgrid_NLO.sh#L95-L112
It seems to me that reweighting is preformed on the lhe ($runname), independently from madspin ($runlabel)
We probably need to add madspin step after reweighting, similar as done in runcmsgrid_LO.sh for LO sample.
Ref: https://hypernews.cern.ch/HyperNews/CMS/get/generators/4243/1/1/1/1.html