cms-sw / genproductions

Generator fragments for MC production
https://twiki.cern.ch/twiki/bin/view/CMS/GitRepositoryForGenProduction
79 stars 786 forks source link

Validate Zg patch from 2.5.6 on top of 2.5.5 #1219

Closed kdlong closed 7 years ago

kdlong commented 7 years ago

This depends on #1218, but can be worked on simultaneously. See initial discussion in [1]. Outline of steps necessary:

1) Start from mg5x branch [2]. Currently this is version 2.5.4. #1218 will move this to 2.5.5, then this should be added on top of it. 2) Additional patches in [3] should be added into the "MadGraph5_aMC@NLO/patches" folder following the other examples [4]. 3) Test that the patches are applied correctly. If you look at the generation script, you'll see that the tarball is downloaded and the patches are applied in one of the first steps. I'd put an exit command after this, the open up the files you tried to patch and look around to make sure things are working [6]. There will also be output message saying whether the patches succeeded or not. 4) Generate the new gridpack in the usual way. Consult with the Generators Validation group about generating events for validation and how to modify the usual workflow to test the specific issues of this request.

@vciulli @qliphy

[1] https://github.com/cms-sw/genproductions/issues/1200 [2] https://github.com/cms-sw/genproductions/tree/mg25x [3] http://bazaar.launchpad.net/~mg5core1/mg5amcnlo/2.5.6/revision/276 [4] https://github.com/cms-sw/genproductions/tree/mg25x/bin/MadGraph5_aMCatNLO/patches [5] https://github.com/cms-sw/genproductions/blob/mg25x/bin/MadGraph5_aMCatNLO/gridpack_generation.sh#L173 [6] https://github.com/cms-sw/genproductions/blob/mg25x/bin/MadGraph5_aMCatNLO/gridpack_generation.sh#L173

kdlong commented 7 years ago

Note that step 1 is addressed by #1221

perrozzi commented 7 years ago

any news? are you planning to complete [2] any time soon?

PenHsuanWang commented 7 years ago

Hi all

I am working on creating the gridpack of "ZATo2LA01j_5f_NLO_FXFX" and I start with the branch[2]. I did a simple test with the example card "wplustest_4f_LO" and it works for me.

So I going to create the gridpack of "ZATo2LA01j_5f_NLO_FXFX" but I face some problem. It has some missing files such as "work/processtmp/SubProcesses/P0_uux_emepa/GF4/results.dat" "work/processtmp/SubProcesses/P0_uux_tamtapa/GF2/results.dat" "work/processtmp/SubProcesses/P0_ddx_tamtapa/GF6/results.dat" .... (There have more missing files and I just post this three.) I put the complete log files on here: logfile

In the end, it tells me to attach the log file for report the problem "work/processtmp/pilotrun_tag_1_debug.log". But the folder processtmp/ does not exist.

I saw the issue #1221 mentioned you have test simple NLO card and no problem, but did not test any more complicated features. Can you have a look the process "ZATo2LA01j_5f_NLO_FXFX"?

Many Thanks Pen Hsuan

rekkhan commented 7 years ago

Hi all,

I also working on the same card as Pen Hsuan, the ZATo2LA01j_5f_NLO_FXFX. As Pen Hsuan has mentioned in his post, we were not be able to generate the event from the grid pack. We also had the same error report that showed some missing files when we tried to produced the grid pack. For me, I tried 3 times. For grid pack creation, the jobs were summited to LSF. On the first try, I was able to create the grid pack with missing files issue, but failed to generated events. On the second time, neither the grid pack nor the LSF log was created, I only received an email from LSF telling that the job I applied had exited. On the last try I was successful to produce both the grid pack and the lhe file.

I'm new to this subject, so I post it here so people could give me some suggestion to figure out what could have caused the problem that Pen Hsuan and I have encountered.

Thank you, Long Hoa.

kdlong commented 7 years ago

Hi @PenHsuanWang, @rekkhan,

This usually means that a small subset of the jobs failed and now the events can't be generated because of it. I usually have better luck on a condor-based cluster than at lxplus. Do you have access to a Tier 2/3 with this setup that you can try? You might also try at Fermilab or with CMSConnect.

If you make a pull request with the changes made which apply the patches, I can also try in parallel.

rekkhan commented 7 years ago

Hi @kdlong,

Thank you for your reply. We tried to apply the jobs to tier2/3 by using

./submit_gridpack_generation.sh My jobs is still running but Pen Hsuan has already had his results, which also have the same problem as we used to have when we submitted the jobs to LSF or run them locally.

I've also checked the log files from my failed jobs. Our problem is we missing a file name result.dat in GF* folder. I searched the logs and found out that for the sub-jobs that failed, they missing a line like this:

[sender] make_file(GF13/results.dat,*,2)

I attached 3 log files to show the details (58621667 is the master job) STDOUT58621667_Master.txt STDOUT58683670_Successful.txt STDOUT58694445_Failed.txt

As you said:

If you make a pull request with the changes made which apply the patches, I can also try in parallel. did you mean that we can send you the card so that you can run it in parallel?

PenHsuanWang commented 7 years ago

Hi @kdlong

I tried to generate the gridpack on local (lxplus) two times and I try to use condor-base cluster by the script "./submit_condor_gridpack_generation.sh" seven times. All of the jobs are failed.

For the case I run locally, it seems there has no problem until all the jobs finish. The error messages are : [1;31mError detected in "launch -n pilotrun" aMCatNLOError : An error occurred during the collection of results. The complete log file is here logfile_runLocal

For the case I submit the task to T2/T3 is more strange. I create seven different work area to run them independently. Some of the tasks also have problem like: INFO: Setting up grids Error reading password from BIO Error getting password ERROR (RUN2): Cannot run openssl to encrypt data:1:

ERROR: failed to read any data from /usr/bin/batch_krb5_credential! Start waiting for update. (more info in debug mode) Error reading password from BIO Error getting password ERROR (RUN2): Cannot run openssl to encrypt data:1:

ERROR: failed to read any data from /usr/bin/batch_krb5credential!

And some of the files can not be read like: WARNING: File /afs/cern.ch/work/p/pwang/GridPack/RunLocal/gen_ZgGridPack_v3_Jul14/genproductions/bin/MadGraph5_aMCatNLO/ZATo2LA01j_5f_NLO_FXFX/ZATo2LA01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P1_uux_emepag/madevent_mintMC is not readable by condor.

And I put the log files of different take at here. logfile_v11 logfile_v13 logfile_v14

And all the takes were canceled due to "ClusterManagmentError" does it related to the problems I have?

Thanks, Pen Hsuan

bendavid commented 7 years ago

Hi, I think there might be some misunderstanding here. This branch as is is NOT ready to produce fixed Zgamma gridpacks yet. There is an additional patch still needed on top as @kdlong explained in item 2. of the original issue. @PenHsuanWang @rekkhan have you added this patch to the branch privately? If you have, please urgently make a pull request. If you have not, then I'm afraid you are wasting time trying to produce the gridpacks with the branch as is. Adding this patch to the branch is a rather critical issue at this point, but should be straightforward.

kdlong commented 7 years ago

Step (2) and (3) addressed by https://github.com/cms-sw/genproductions/pull/1247

Start from this branch to test the generation. @PenHsuanWang, what T2/3 are you using when you get the /usr/bin/batch_krb5_credential error? As far as I know the afs permissions issues are lxplus specific. There seems to be a clash between k5reauth and MadGraph, so I would just not use it on any other setup.

qliphy commented 7 years ago

Hi all,

I have done several tests at LPC under 255 w/o patches for ZA with 25X branch, and also 242 with master branch.

./submit_condor_gridpack_generation.sh ZATo2LA01j_5f_NLO_FXFX cards/production/13TeV/ZATo2LA01j_5f_NLO_FXFX

(1) 242 is smooth, and the produced gridpack works well to produce events; (2) 255 gridpack generations is fine, but when trying to produce events with it, I got errors as [a]

I have tried >3 times at LPC for (2) and several times at lxplus with LSF submission, all with similar problems. My current guess is there is something suspicious for 25X for FxFx (previous test with dy01234MLM is fine)

FYI: You can access to my area at LPC for 25X [b] and 242 [c]

The log files for 25X and 242 are also attached, note the xsec are different although the cards are the same: 25X.txt 242.txt

[a] aMCatNLOError : An error occurred during the collection of results. Please check the .log files inside the directories which failed: /uscms_data/d3/qliphy/mg25/genproductions/bin/MadGraph5_aMCatNLO/ZATo2LA01j_5f_NLO_FXFX/ZATo2LA01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P0_ddx_emepa/GF4/log.txt

[b]/uscms_data/d3/qliphy/mg25/genproductions/bin/MadGraph5_aMCatNLO/REFS

[c] /uscms_data/d3/qliphy/mg25/g242/bin/MadGraph5_aMCatNLO/

kdlong commented 7 years ago

Hi @qliphy, all, I don't know if it's expected that the event generation simply fails without the 2.5.6 patch or not. In any case, I think the thing to do is to try it with the patch first and then ask the authors if we see issues. I'm trying it now, but it would be good if someone else can also run.

qliphy commented 7 years ago

Hi @kdlong , indeed I tried under 255 in both cases w or w/o the patch. Sorry I didn't make it clear in previous message.

bendavid commented 7 years ago

Hi, Do the "standard" Z and W + jets processes work?

qliphy commented 7 years ago

@bendavid Previous test with 254 for dy01234MLM is fine

@kdlong FYI: The patch I used was made by myself, but it should be similar as yours.

0008-vgammafxfx.txt

bendavid commented 7 years ago

I mean for Z/W+jets FXFX.

qliphy commented 7 years ago

@bendavid I have not tried yet for Z/W+jets FXFX. But tt012J FxFx failed.

PenHsuanWang commented 7 years ago

Hi @kdlong

Actually, I don't know which T2/T3 I submitted because I saw the script "submit_condor_gridpack_generation.sh" do not have the argument about the setting the site white list. In the log file, it does not mention about which site it submitted.

Also, I tried to run it on local (lxplus) and no errors when I created the gridpack, but when I producing the event and I get the error that @qliphy have.

Error detected in "launch -n pilotrun" write debug file /afs/cern.ch/work/p/pwang/GridPack/RunLocal/gen_ZgGridPack_Jul17/v4/genproductions/bin/MadGraph5_aMCatNLO/ZATo2LA01j_5f_NLO_FXFX/ZATo2LA01j_5f_NLO_FXFX_gridpack/work/processtmp/pilotrun_tag_1_debug.log If you need help with this issue please contact us on https://answers.launchpad.net/mg5amcnlo aMCatNLOError : An error occurred during the collection of results.

rekkhan commented 7 years ago

Hi @bendavid

From Kenneth's first post and your reply on this issue (#1219), I assume that we cannot produce ZGamma grid pack at the moment. What we should try is privately adding new patches from [1] to:

/genproductions/bin/MadGraph5_aMCatNLO/patches/ However I do not know where can I get the *.patch file(s) as we have in example [2]. [1] http://bazaar.launchpad.net/~mg5core1/mg5amcnlo/2.5.6/revision/276 [2] https://github.com/cms-sw/genproductions/tree/mg25x/bin/MadGraph5_aMCatNLO/patches
kdlong commented 7 years ago

@rekkhan: this was addressed by #1247 and is now part of the master branch.

@PenHsuanWang to submit to a Tier 2 or Tier 3 cluster directly, you need to run from the interactive nodes at that specific Tier 2/3. If you submit from lxplus, the jobs are submitted to the condor cluster associated with lxplus. There are known problems with this at lxplus at the moment, which is where the errors you see come from.

qliphy commented 7 years ago

Hi all,

A bit more info: I just tested with dyellell012j FXFX under 25X, and entered the same situation of previous tt012j FXFX. That is, gridpack generation is fine, but event production (runcmsgrid.sh step) doesn't work with the error message like [1]. I have also tested with a simpler process dyellell01j FXFX, and met the same problem.

The run_card.dat from 25X official seems to contain more info than our cards [2], not sure whether this can brings changes. Maybe we should test more...

[1]aMCatNLOError: An error occurred during the collection of results. Please check the .log files inside the directories which failed: /uscms_data/d3/qliphy/mg25/gTT012/bin/MadGraph5_aMCatNLO/dyellell012j_5f_NLO_FXFX/dyellell012j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P0_ddx_epem/GF2/log.txt

[2] -1 = dynamical_scale_choice ! Choose one (or more) of the predefined ! dynamical choices. Can be a list; scale choices beyond the ! first are included via reweighting 1.0 = muR_over_ref ! ratio of current muR over reference muR 1.0 = muF_over_ref ! ratio of current muF over reference muF 1.0, 2.0, 0.5 = rw_rscale ! muR factors to be included by reweighting 1.0, 2.0, 0.5 = rw_fscale ! muF factors to be included by reweighting

kdlong commented 7 years ago

I think the obvious thing to do is to check whether it works in standalone MadGraph.

qliphy commented 7 years ago

I just did standalone tests at my laptop under 255 and 242 for dy01j FXFX.
242 works well. But with 255 I got the same error.

I have posted a report to launchpad: https://bugs.launchpad.net/mg5amcnlo/+bug/1706072

qliphy commented 7 years ago

We get a quick answer from the MG authors: https://bugs.launchpad.net/mg5amcnlo/+bug/1706072

It seems there have been some bugs found, and needs to be tested for our case.

@rekkhan @PenHsuanWang Would you please have a check with the additional patches as mentioned in the answer from the authors? Thanks!

kdlong commented 7 years ago

Great, thanks Qiang! I'll add the patch shortly.

kdlong commented 7 years ago

Actually this is the same patch as was added in #1247. Didn't you say that it didn't work for you even with this patch, @qliphy?

qliphy commented 7 years ago

@kdlong You are right. I overlooked it. But I have not done standalone test with the patches. Will wait for the result and then reply to the authors.

BTW: Just tested dy012jFXFX works with MG252 by condor submission. So it seems 255 is really buggy.

rekkhan commented 7 years ago

Hi @qliphy

I read the bug report topic you'd mentioned on your post. I assume that what I could try at the moment is edit the driver_mintMC.f file as was mentioned on the report topic [1].

Since I'm trying to produce the grid pack from lxplus, so I wonder whether I could do that on lxplus, or I have to do that privately on my local machine.

In case I want to produce the grid pack on my computer, can I use the genproduction on git to process? Or just use ./bin/mg5_aMC _proc_card.dat?

[1]https://bugs.launchpad.net/mg5amcnlo/+bug/1701612

qliphy commented 7 years ago

Hi @rekkhan
Indeed as pointed out by Kenneth, the patches are the same as we noticed before and already added [1], and you can go directly with the branch 25x: git clone git@github.com:cms-sw/genproductions.git genproductions -b 25x

Hi All, I have done some standalone test for FxFx: (1) Without the patch, dy01j FXFX failed; (2) With the patch dy01j FXFX OK; (3) With the patch ZATo2LA01j still ongoing, but looks ok for the moment So it seems indeed there are some buggy thing in 25X for 255. (Note I changed 255 to 252 in gridpack_generation.sh and then it works again)

So as a summary, the problem is still there for FXFX with 255, although we have more information. @rekkhan Maybe you can recheck above thing with simple process like dy01j (takes about 1-2hrs)

[1] https://github.com/cms-sw/genproductions/pull/1247

qliphy commented 7 years ago

Hi All, The problem might be possibly related to a file [1] in gridpack.

I did some tests as following: untar 255 buggy gridpack, replacing the file with the one from 252 or 242 [2]. Then everything works! I have added this info at lauchpad for confirmation from the authors. [3]

So maybe @rekkhan @PenHsuanWang can have a try with the above point.

[1]process/bin/internal/amcatnlo_run_interface.py

[2] You can either get from a 242 gridpack, or go to https://cms-project-generators.web.cern.ch/cms-project-generators and download a MG and check madgraph/interface/amcatnlo_run_interface.py

[3]https://bugs.launchpad.net/mg5amcnlo/+bug/1706072?comments=all

qliphy commented 7 years ago

Hi All, We get answer from the author [1] and they seem find a bug and the new patch is here [2] @rekkhan @PenHsuanWang Would you please have a try? Thanks!

[1]https://bugs.launchpad.net/mg5amcnlo/+bug/1706072 [2]https://launchpadlibrarian.net/330558762/gridpack.patch

kdlong commented 7 years ago

Thanks Qiang! Patched added in #1254 (though I only tested that the patch is applied properly, not that it fixes the problem)

rekkhan commented 7 years ago

Hi @qliphy and @kdlong

I'm trying to produce the grid pack for 2 processes, [1] and [2] on both version: 2.5.5 and 2.4.2. My jobs were submitted to LSF.

The jobs on v2.4.2 have stopped with the report [3] at the end of the log file. I also notice that the sub-process of these jobs were not executed as well. Jobs on version 2.5.5 (with the recently updated) are still running. The sub-processes are being executed.

Could you please tell me how to submit my jobs to Tier 2/3 since I'm afraid we may still have problem with lxplus.

Thank you.

[1] cards/production/13TeV/ZATo2LA01j_5f_NLO_FXFX [2] cards/production/13TeV/dyellell1j_5f_NLO_FXFX_M10to50 [3] moving generated process to working directory WARNING: You've chosen not to use the PDF sets recommended for 2017 production! If this isn't intentional, and you prefer to use the recommended sets insert the following lines into your process-name_run_card.dat: /afs/cern.ch/work/l/lcaophuc/Task_GP24201/genproductions/bin/MadGraph5_aMCatNLO/gridpack_generation.sh: line 477: DEFAULT_PDF_SETS: unbound variable

kdlong commented 7 years ago

@rekkhan this looks like a bug I introduced with some unrelated changes. I'll push a fix shortly.

kdlong commented 7 years ago

Should be fixed by #1258

rekkhan commented 7 years ago

Hi @kdlong

All the jobs that I submitted to LSF (on both v242 and v255 with the recent patch) have failed: The jobs of 242 failed to produced grid pack with the warning I post on the previous post. Jobs on v255 produced grid pack with the missing file error and can't be used to generate event.

Pen Hsuan however was successful to produce grid pack on v255 locally (lxplus). He can generate some events from the pack.

So I think we have some problem on LSF.

PenHsuanWang commented 7 years ago

Hi @kdlong @qliphy

I can create the ZATo2LA01j_5f_NLO_FXFX events successfully with the update patch. and I create 1M events and check the distribution here. I remember the v2.2.2 has the problem that asymmetry photon eta distribution. So I quickly check the photon eta distribution[1] and it looks fine now. And I also plot Mll [2] and Mllg [3]

[1] http://pwang.web.cern.ch/pwang/GridPack_creation/ZATo2LA01j_5f_NLO_FXFX/MG_2_5_5_ZATo2LA01j_5f_NLO_FXFX_phoEta.png

[2] http://pwang.web.cern.ch/pwang/GridPack_creation/ZATo2LA01j_5f_NLO_FXFX/MG_2_5_5_ZATo2LA01j_5f_NLO_FXFX_Mll.png

[3] http://pwang.web.cern.ch/pwang/GridPack_creation/ZATo2LA01j_5f_NLO_FXFX/MG_2_5_5_ZATo2LA01j_5f_NLO_FXFX_Mllg.png

qliphy commented 7 years ago

Thanks @PenHsuanWang @rekkhan ! It is great to see the eta distribution.

Note as mentioned from [1], we should probably also look at WGamma, to make it sure.

"In principle, any 1->3 decay could be affected. In practice, the problem should be most severe for the W boson, since the possible interactions it can have with quarks are more restricted than the Z boson, since it's flavour changing. I think it would be better to also re-generate the l+l-+gamma sample, since the problem might also be there."

[1] https://answers.launchpad.net/mg5amcnlo/+question/631090

perrozzi commented 7 years ago

can this be considered as solved, and the github issue closed?

qliphy commented 7 years ago

@perrozzi The gridpack problem is resolved. The next step may be for @rekkhan @PenHsuanWang to follow the validatation and ask for production.

perrozzi commented 7 years ago

changed title

kdlong commented 7 years ago

Hi all,

I produced the Wgamma gridpack using CMSConnect in 2.5.5. I generated 1000 events from the gridpack without issues but did not check any results from it. You can find it at

/afs/cern.ch/user/k/kelong/public/WAToLNuA01j_5f_NLO_FXFX_slc6_amd64_gcc481_CMSSW_7_1_28_tarball.tar.xz

Cross section output:


  Summary:
  Process p p > lep nu a [QCD] @0 ; p p > lep nu j a [QCD] @1
  Run at p-p collider (6500.0 + 6500.0 GeV)
  Number of events generated: 1000
  Total cross section: 8.227e+02 +- 1.2e+00 pb

rekkhan commented 7 years ago

Hi all,

I still have problem creating grid pack. Pen Hsuan was successful to create the grid pack locally on lxplus, but I failed.

I run the jobs on background, but the log file is empty after the jobs finished so I cannot tell what has happened. I re-ran the process one more time. I will let you know the result when the job is done.

The following lines are the steps I followed to execute the jobs on lxplus: git clone git@github.com:cms-sw/genproductions.git genproductions cd genproductions git checkout remotes/origin/mg25x cd bin/MadGraph5_aMCatNLO/ ./gridpack_generation.sh ZATo2LA01j_5f_NLO_FXFX cards/production/13TeV/ZATo2LA01j_5f_NLO_FXFX/ 2nw &> log255.txt &

kdlong commented 7 years ago

@rekkhan The "2nw" argument tells LSF which cluster you would like to submit to. If you're running locally you should leave it out. This is probably interfering with other optional arguments.

e.g., just run as:

./gridpack_generation.sh ZATo2LA01j_5f_NLO_FXFX cards/production/13TeV/ZATo2LA01j_5f_NLO_FXFX/ &

You don't need to pipe the output to a file because its saved in gridpack_generation.log anyway.

But since the gridpack has already been successfully produced, I think the best way forward would be to generate some events from a successful gridpack and compare it to the old buggy gridpack.

rekkhan commented 7 years ago

Hi @kdlong

I ran the process before you told me to exclude the queue option from the command. The task created a grid pack with some error. No event can be generate from that grd pack. [1] is the log file from my newest try. [2] is the log from events generation.

I'll try to generate events from the successful grid packs.

[1] ZATo2LA01j_5f_NLO_FXFX.txt [2]logtest.txt

kdlong commented 7 years ago

Hi @rekkhan,

I'm not sure what's causing this error, it may be lxplus/LSF specific. We'll investigate and get back to you.

For now can you try to generate events from the successful gridpack?

rekkhan commented 7 years ago

Hi @kdlong

Yes, I'm generating events from the grid pack that Pen Hsuan created.

perrozzi commented 7 years ago

Hi @rekkhan, any news?

kdlong commented 7 years ago

Hi all,

I keep getting an error when running the high pT Wgamma process. I've double checked that the patches are still applied and that this error is still there. The error, which you can see in the gridpack below, points to an error in the subprocess P1_gd_emvexau/GF4

In process/SubProcesses/P1_gd_emvexau/GF4/log.txt, I see

At line 324 of file driver_mintMC.f (unit = 12, file = 'mint_grids')
Fortran runtime error: End of file

driver_mintMC.f is one of the files patched for this bug, but the patch is correctly applied.

You can find the gridpack here, which includes the .f files so the sample could be rerun from scratch if necessary. I'd appreciate any cross checks and suggestions: /afs/cern.ch/user/k/kelong/public/WAToLNuA01j_5f_pta130_NLO_FXFX_slc6_amd64_gcc481_CMSSW_7_1_28_tarball.tar.xz

rekkhan commented 7 years ago

Hi all,

I'm sorry for my late response.

I've generated events for ZATo2LA01j_5f_NLO_FXFX from the official grid pack (v2.2.2) and the new grid pack created by Pen Hsuan (v2.5.5). I made some comparison plots, you can find them in [1] I found no remarkable difference between 2 versions.

I will try to generate events from WGamma grid pack that Kenneth produced to see whether the bug allows us to generate events or not.

[1] Report_Temp_002.pdf

qliphy commented 7 years ago

@kdlong I have produced a WAToLNuA01j_5f_pta130_NLO_FXFX gridpack with 25X branch, and everything seems fine. You can find the gridpack here:

/eos/cms/store/user/qili/gridpacks/mg255/WAToLNuA01j_5f_pta130_NLO_FXFX_slc6_amd64_gcc481_CMSSW_7_1_28_tarball.tar.xz

@rekkhan Thanks. You can check with the above mentioned WGamma gridpack. In the meantime, note MG260 is just released [1]. It would be interesting to test this new version and check all the patches works or not.

[1]https://github.com/cms-sw/genproductions/issues/1290