JeffersonLab / halld_recon

Reconstruction for the GlueX Detector
7 stars 9 forks source link

reaction filter with 4 photon causes OOM/crashes #189

Closed T-Britton closed 5 years ago

T-Britton commented 5 years ago

input file example: /osgpool/halld/tbritton/genr8_030571_000_geant4_smeared.hddm

JANA >> --- Configuration Parameters -- JANA >> JANA:RESOURCE_DEFAULT_PATH =
JANA >> NTHREADS = 1
JANA >> PLUGINS = danarest,monitoring_hists,mcthrown_tree,ReactionFilter JANA >> Reaction1 = 1_14__1_1_14
JANA >> Reaction2 = 1_14__1_1_1_1_14

jana config file used: /osgpool/halld/tbritton/REQUESTEDMC_CONFIGS/649_jana.config

software used: version_recon-2017_01-ver03_8.xml

which corresponds to:

out/error: /osgpool/halld/tbritton/REQUESTEDMC_OUTPUT/wmcginle_Deltapluseta_more5_20190806092821am/log/out_wmcginle_Deltapluseta_more5_20190806092821am_30571_0.log /osgpool/halld/tbritton/REQUESTEDMC_OUTPUT/wmcginle_Deltapluseta_more5_20190806092821am/log/error_wmcginle_Deltapluseta_more5_20190806092821am_30571_0.log

David discovered that removing Reaction2 allows things to run. David reports this file crashing on event 14. These jobs were crashing worker nodes @ UChicago by driving them OOM chicagoOOM

When I look at a few jobs of this ilk that finished I do see some instances where the jobs report using upwards of 81 GB of RAM used. The tiny successful fraction (~2.5%) used 2 GB. This would seem to indicate something is FUBAR in the data (or its processing) in large number of cases. The example file has captured such data.

sdobbs commented 5 years ago

It would be good to do this test again with the new fiducial cuts and see if this problem shows up.

I also don't understand the motivation for doing g p -> gggg p - it is maybe not surprising to be overwhelmed with combinatorics in this channel! It would be better to include the intermediate states one is looking for (e.g. pi0, eta) with some looser invariant mass cut.

markdalton commented 5 years ago

Another reminder that we need to start implementing a default cut of no more than 1 extra photon per event.

mashephe commented 5 years ago

On Aug 8, 2019, at 10:34 AM, dalton notifications@github.com wrote:

Another reminder that we need to start implementing a default cut of no more than 1 extra photon per event.

+1

Reconstructing the multi-gamma channels without intermediate requirements can be beneficial. I suspect that once the extra photon requirement is enabled, then this problem will go away.

The maximum number of combos in an event (but evidently not the average number of combos in an event) is much greater when one is looking for the intermediate states also. This leads tricky backgrounds sometimes and double counting: cross feed of the signal into itself.

aaust commented 5 years ago

Sounds a lot like issue #95 to me, which was fixed a long time ago. Please try a recent version of the analysis library.

T-Britton commented 5 years ago

We have to be very careful with just "updating" a package. This MC is being produced to match a reconstruction/analysis launch. The version set reproduced below is, as far as recon goes, the latest one corresponding to recon 2017_01-ver03. A work around in this case may be to remove Reaction filter from running alongside danarest (older code). But this is certainly not a permanent solution as it will always be vulnerable to mis-use. I'll work to develop a more strict testing procedure to hopefully catch these cases more often (it will never/can never be 100%). Personally, I am surprised I haven't seen this before as 4 gamma is not a crazy channel. Nor is it rare.

<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="https://halldweb.jlab.org/dist/version4.xsl"?>

hdds and halld_recon as used for recon-2017_01-ver03, other packages with latest update releases.
aaust commented 5 years ago

We implemented a procedure to run the reconstruction with the appropriate halld_recon and the analysis with another one, didn't we? Since the channel does not work in this configuration, it was never run with this xml over real data.

T-Britton commented 5 years ago

yes. This does open up potentially other cans of worms if I try to modify how this functions much and basically shows the call of recon and analysis being split (I am not trying to start that debate again!). For the offending lines I am removing them and the submitter will run the line over the danarest "offline". When I resubmit them and see they don't OOM worker nodes nation wide I will close this issue