JeffersonLab / halld_recon

Reconstruction for the GlueX Detector
6 stars 8 forks source link

Problems with photon energies in MC samples #537

Closed sdobbs closed 1 year ago

sdobbs commented 3 years ago

I'm putting this issue here, since I'm not quite sure where some of the recent errors reported involving photon energies in the tagger in MC sample made over the past couple of months are coming from, but the solutions possible involve changes to halld_recon, so this seems as good a place as any. Also, this is a good place to collect information on the problem, and have a broader discussion, since I've been going in circles the past few days.

I am seeing two main problems:

The first is that some REST files being generated fail the column/energy consistency check, giving errors like this:

Error in DEventSourceREST - tagger microscope energy lookup mismatch: TAGM column = 42 Etag = 8.52341 lookup finds TAGM column = 41

This seems to be some issue regarding the random trigger files. For example, compare Project #1951 and #1949 - these are the same except that 1951 doesn't use random triggers and succeeds, and 1949 does use random triggers and fails.

I haven't always been able to reproduce these problems, but some random trigger files do seem to have the wrong photon energy associated with the counter - I have no idea how this could have happened or why some jobs run without problems. Maybe there's some rounding error, where the photon energy for some counters was assigned near the counter boundary so sometimes the reverse lookup picks the wrong paddle, though this also seems unlikely.

Anyway, there are a couple potential fixes for this, assuming the problem is in the data and not some rounding error - since we have the counter information stored in the random trigger HDDM files, I could easily write a script to reprocess these HDDM files and assign the "correct" beam energy for each event. Alternatively, we could ignore this data by merging PR #536 and propagating this change as needed. I don't have a strong preference.

The second problem is that for most/all of the bggen samples, there seems to be problems with how the beam energy is being set, at least for hits showing up in the hodoscope.

For example, trying to analyze the file /lustre19/expphy/cache/halld/gluex_simulations/REQUESTED_MC/F2018_ver02_18_bggen_batch02_20210630092841pm/hddm/dana_rest_bggen_051363_000.hddm gives the error

Error in DEventSourceREST - tagger hodooscope energy lookup mismatch:
   TAGH column = 229
   Etag = 7.09578
   lookup finds TAGH column = 203

and looking at the data file, it looks like the TAGGEDMCGEN counter is incorrectly set.

    <taghBeamPhoton E="7.09578" Eunit="GeV" jtag="" t="84.6468" tunit="ns">
      <taghChannel counter="203" />
    </taghBeamPhoton>
[...]
    <taghBeamPhoton E="7.09578" Eunit="GeV" jtag="TAGGEDMCGEN" t="3.34066" tunit="ns">
      <taghChannel counter="229" />
    </taghBeamPhoton>

Now if one tries to analyze /cache/halld/gluex_simulations/REQUESTED_MC/2017_bggen_batch01_ver03_28_20210616121407pm/hddm/dana_rest_bggen_030274_000.hddm for the thrown photon, both the reconstructed and TAGGEDMCGEN photon have their counter set to zero, with the resulting error message:

   TAGH column = 0
   Etag = 3.88208
   lookup finds TAGH column = 275

Again, when trying to reproduce these errors, I wasn't able to and got the first error I described instead.

Any thoughts are appreciated - I can also post more information on some of the checks that I've done on request (or can try new ideas out).

sdobbs commented 3 years ago

Some more progress, the "good" news is that I haven't been able to reproduce these problems on the OSG in small bggen samples. Still looking into the first issue.

On the second issue, I'm not sure exactly where this is coming from, but I did find a couple places where a DBeamPhoton was created from DMCReaction data and the counter wasn't then properly set, so I am going to fix that. (Note that this only affects trying to use the TAGGEDMCGEN and MCGEN photons).

One additional issue that is probably making things weird is thrown photons which fall between tagger counters are still saved by hdgeant4 as belonging to the TAGH or TAGM, but with counter == 0. I'll leave the question of the correctness of this behavior to @rjones30 , but I think the correct way to handle this in halld_recon is to set the system of such hits to SYS_NULL, which seems consistent with previous behavior. So I'll go back and revise my outstanding PR on this topic.

rjones30 commented 3 years ago

Sean and all,

One additional issue that is probably making things weird is thrown photons

which fall between tagger counters are still saved by hdgeant4 as belonging to the TAGH or TAGM, but with counter == 0.

This is not intended behavior. I will look into it and submit a PR for it. -Richard Jones

On Tue, Aug 3, 2021 at 2:31 PM Sean Dobbs @.***> wrote:

Some more progress, the "good" news is that I haven't been able to reproduce these problems on the OSG in small bggen samples. Still looking into the first issue.

On the second issue, I'm not sure exactly where this is coming from, but I did find a couple places where a DBeamPhoton was created from DMCReaction data and the counter wasn't then properly set, so I am going to fix that. (Note that this only affects trying to use the TAGGEDMCGEN and MCGEN photons).

One additional issue that is probably making things weird is thrown photons which fall between tagger counters are still saved by hdgeant4 as belonging to the TAGH or TAGM, but with counter == 0. I'll leave the question of the correctness of this behavior to @rjones30 https://github.com/rjones30 , but I think the correct way to handle this in halld_recon is to set the system of such hits to SYS_NULL, which seems consistent with previous behavior. So I'll go back and revise my outstanding PR on this topic.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/537#issuecomment-892070583, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWHAVZZYTQ4M6FSRUQ3T3AYZ5ANCNFSM5AQA5RGA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

sdobbs commented 3 years ago

I updated PR #536 with some fixes that should resolve some of these crashes, and fix at least one case where some variables were uninitialized. Hopefully this fixes the TAGGEDMCGEN issue going forward, and for a couple of the existing problem bggen batches, these can be analyzed without a problem as long as the user is not looking for this data.

Now back to the original "off by one" error. This only shows up for some runs, for example, 50697 - the photon energies are incorrect in that run. I tried reanalyzing some data and got the correct energy. So that's weird - my guess is that the wrong electron beam energy was used, but haven't found any log files or evidence about how this could have happened for this particular run. So, I'm going to go back and try to find and fix any of these "bad" random trigger files. But, if PR #536 is merged, then the energy information will be ignored anyway.

sdobbs commented 3 years ago

As of yesterday, I've gone through and fixed/updated the random trigger files that I could identify with bad tagger photon energies, and they've been updated on cache and propagated so that they are available from OSG. A list of runs is given below. I checked a run on the command line and OSG and this seems to work okay, so I'll make an announcement and see if we can get some wider use.

In some of my local tests (not OSG), I ran into the TAGGEDMCGEN error, so I'll work on updating the recon branches with this fix, then will close this issue.

2017-01 (recon-2017_01-ver03.2) - 30274, 30409, 30441, 30598, 30667, 30844, 31018

2018-01 (recon-2018_01-ver02.2) - 41217, 41249, 41250, 41265, 41273, 41467, 41472, 41475, 41491, 41876, 41887, 42032, 42043, 42056, 42173, 42402, 42428, 42439

2018-08 (recon-2018_08-ver02.2) - 50685, 50697, 50698, 50699, 50700, 50701, 50702, 50703, 50704, 50705, 50706, 50707, 50709, 50710, 50798, 50812, 50940, 51214, 51336, 51503, 51512, 51539, 51768

sdobbs commented 2 years ago

The new trigger files should all be there - sorry for not closing this out earlier.

s6pepaul commented 1 year ago

So it seems like this issue returned after recent changes to the ccdb (https://halldweb.jlab.org/cgi-bin/ccdb/show_request?request=/PHOTON_BEAM/microscope/scaled_energy_range:30000:default:2023-01-28_05-41-22). Trying to generate some MC using environment file: /group/halld/www/halldweb/html/halld_versions/recon-2017_01-ver03_34.xml and analysis Environment file: /group/halld/www/halldweb/html/halld_versions/analysis-2017_01-ver46.xml for run 30274 will fail with a tagger hodooscope energy lookup mismatch and a following segfault. The error disappears when not including random background or setting the calibtime to 2023-01-25 (i.e. before the changes to the ccdb).

Files to reproduce this are in /volatile/halld/home/ppauli/test_jamie/. Copy the MC.config to your own location and change the DATA_OUTPUT_BASE_DIR. Set up your environment to use recon-2017_01-ver03_34.xml (e.g. gxenv /group/halld/www/halldweb/html/halld_versions/recon-2017_01-ver03_34.xml) and then run '$MCWRAPPER_CENTRAL/gluex_MC.py MC.config 30274 100'. This should do the trick.

rjones30 commented 1 year ago

Aha, so it is related to reading stale TAGM information from the real data (random triggers). Peter, help me track this sequence:

  1. random triggers from EVIO input get converted into hdds format, where tagger energy is assigned from a ccdb lookup on the column number from the translation table (both E and column number are recorded in hdds)
  2. this gets read into hd_root during reconstruction and saved in the REST output, tagmBeamPhoton where both E and column number are recorded
  3. so far there is no detection of any problems
  4. then during the analysis phase, hd_ana reads in the REST data, and the DEventSourceREST code compares the tagmBeamPhoton E value with what it gets by looking up the tagmBeamPhoton column number in ccdb
  5. this comparison fails and the analysis bombs

There is full protection against this sort of cross-threading between variation=mc and variation=default EXCEPT for the case of random triggers. This is a design problem, not just a ccdb inconsistency. There is no guarantee that the different variations (mc vs default) must have exactly the same energy bounds for the tagger counters. But here in the hits merging it is assumed to be the case.

Here is my recommendation to fix this problem. Others (Alexander, Justin, Sean) should weigh in as well.

-richard jones

On Tue, Feb 7, 2023 at 7:29 AM Peter Hurck @.***> wrote:

Reopened #537 https://github.com/JeffersonLab/halld_recon/issues/537.

— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/537#event-8457530583, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWGLUTT5CIPN2CGXEILWWI52RANCNFSM5AQA5RGA . You are receiving this because you were mentioned.Message ID: @.***>

rjones30 commented 1 year ago

PS. It might be better to locate the overwriting of the tagger E value (based on ccdb column lookup) in mcsmear rather than in the DEventSourceHDDM reader. That fixes the problem at the source (when the random hits are first being merged into the MC hits) and avoids any confusion later in the analysis where repeated runs might give different results because of randomization of the tagged energy within a tagger bin. Conceptually, mcsmear is the place where this belongs.

-rtj

On Tue, Feb 7, 2023 at 11:16 AM Richard Jones @.***> wrote:

Aha, so it is related to reading stale TAGM information from the real data (random triggers). Peter, help me track this sequence:

  1. random triggers from EVIO input get converted into hdds format, where tagger energy is assigned from a ccdb lookup on the column number from the translation table (both E and column number are recorded in hdds)
  2. this gets read into hd_root during reconstruction and saved in the REST output, tagmBeamPhoton where both E and column number are recorded
  3. so far there is no detection of any problems
  4. then during the analysis phase, hd_ana reads in the REST data, and the DEventSourceREST code compares the tagmBeamPhoton E value with what it gets by looking up the tagmBeamPhoton column number in ccdb
  5. this comparison fails and the analysis bombs

There is full protection against this sort of cross-threading between variation=mc and variation=default EXCEPT for the case of random triggers. This is a design problem, not just a ccdb inconsistency. There is no guarantee that the different variations (mc vs default) must have exactly the same energy bounds for the tagger counters. But here in the hits merging it is assumed to be the case.

Here is my recommendation to fix this problem. Others (Alexander, Justin, Sean) should weigh in as well.

  • In DEventSourceHDDM, change the reader of hit tags (not TRUTH tags) to ignore the E value in the hddm file, and do a ccdb lookup on the column number to fill in the tagger E of the in-memory DBeamPhoton object.
  • In the present context, the mc variation will always be used to assign the tagger energy, and only the column number is guaranteed to be consistent as the tagger hit descends through reconstruction and analysis.
  • No inconsistencies will be present in memory between column number and energy at any point in smearing, reconstruction or analysis
  • The current state of the ccdb in whatever variation is active will govern the energy assignment, without historical dependency
  • I would also recommend randomizing the tagger energy within a given column in the DEventSourceHDDM reader, to suppress the annoying picket-fence effect one gets by always assigning the center of the tagger bin. This never throws away information because the column number is always the ultimate determiner of the tagged photon energy, not the current E value in the tagger object.

-richard jones

On Tue, Feb 7, 2023 at 7:29 AM Peter Hurck @.***> wrote:

Reopened #537 https://github.com/JeffersonLab/halld_recon/issues/537.

— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/537#event-8457530583, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWGLUTT5CIPN2CGXEILWWI52RANCNFSM5AQA5RGA . You are receiving this because you were mentioned.Message ID: @.***>

s6pepaul commented 1 year ago

Hi Richard, I think the sequence that you laid out more or less captures the situation. For most of this @sdobbs is the best to answer your questions though. He generated the random triggers and is more familiar with the whole procedure than I am.

"There is no guarantee that the different variations (mc vs default) must have exactly the same energy bounds for the tagger counters." - I understand that this is not guaranteed in the code, but should this not be the case? If our MC are supposed to capture the conditions on a per run basis. Why do we allow for these inconsistencies?

"I would also recommend randomizing the tagger energy within a given column in the DEventSourceHDDM reader, to suppress the annoying picket-fence effect one gets by always assigning the center of the tagger bin. This never throws away information because the column number is always the ultimate determiner of the tagged photon energy, not the current E value in the tagger object." - Do you mean randomise the tagger energy of the random trigger hit or for every hit on the tagger? I actually think to have the picket fence is good. It reminds us that we have only a sampling tagger below a certain energy and don't actually measure the continuum. To assign the center of the tagger is as good as we can reasonably be.

rjones30 commented 1 year ago

I understand that this is not guaranteed in the code, but should this not be the case? If our MC are supposed to capture the conditions on a per run basis. Why do we allow for these inconsistencies?

We don't want to try to do that so that a single MC file can stand in for a run period over which the real data ccdb tables can vary (beam energy for example) from one minute to the next.

-Richard Jones

On Tue, Feb 7, 2023 at 11:35 AM Peter Hurck @.***> wrote:

Hi Richard, I think the sequence that you laid out more or less captures the situation. For most of this @sdobbs https://github.com/sdobbs is the best to answer your questions though. He generated the random triggers and is more familiar with the whole procedure than I am.

"There is no guarantee that the different variations (mc vs default) must have exactly the same energy bounds for the tagger counters." - I understand that this is not guaranteed in the code, but should this not be the case? If our MC are supposed to capture the conditions on a per run basis. Why do we allow for these inconsistencies?

"I would also recommend randomizing the tagger energy within a given column in the DEventSourceHDDM reader, to suppress the annoying picket-fence effect one gets by always assigning the center of the tagger bin. This never throws away information because the column number is always the ultimate determiner of the tagged photon energy, not the current E value in the tagger object." - Do you mean randomise the tagger energy of the random trigger hit or for every hit on the tagger? I actually think to have the picket fence is good. It reminds us that we have only a sampling tagger below a certain energy and don't actually measure the continuum. To assign the center of the tagger is as good as we can reasonably be.

— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/537#issuecomment-1421070740, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWB5PKME52IVQGDILMTWWJ2URANCNFSM5AQA5RGA . You are receiving this because you were mentioned.Message ID: @.***>

sdobbs commented 1 year ago

We should already be letting the tagger counter determine the energy that is assigned to any tagger hit when it's ready in. I'll look into it this afternoon.

rjones30 commented 1 year ago

Sean,

Yes, but in the random triggers processing into hdds, are you using the mc variation or the default? If my diagnosis is correct, you were using the default, or mc was falling through to default because endpoint_energy or endpoint_calib were not assigned in mc.

-rtj

On Tue, Feb 7, 2023 at 11:58 AM Sean Dobbs @.***> wrote:

We should already be letting the tagger counter determine the energy that is assigned to any tagger hit when it's ready in. I'll look into it this afternoon.

— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/537#issuecomment-1421104963, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWF5B3TMITOAFWQW55TWWJ5KVANCNFSM5AQA5RGA . You are receiving this because you were mentioned.Message ID: @.***>

s6pepaul commented 1 year ago

Many users are reporting this issue to me when they try to produce MC. Is there any recent work on this?

rjones30 commented 1 year ago

Peter,

I submitted a PR for this (branch name rest_small_etag_shifts_tolerance_rtj) and it sat there for a week without action or comment, so I canceled the PR. If it is not a problem, there is no need to fix it, but the branch is there if you want to take a look.

-Richard Jones

On Tue, Mar 14, 2023 at 12:43 PM Peter Hurck @.***> wrote:

Many users are reporting this issue to me when they try to produce MC. Is there any recent work on this?

— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/537#issuecomment-1468454259, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWHIMKN7HCLWZCZZP23W4CNZ7ANCNFSM5AQA5RGA . You are receiving this because you were mentioned.Message ID: @.***>

s6pepaul commented 1 year ago

Right, if I am not mistaken this would fix things going forward, but most people running simulations to do acceptance correction use reconstruction and analysis software that are a bit older. Would we patch this into old versions?

rjones30 commented 1 year ago

Peter,

In that case, don't they also simulate with old sqlite versions of the ccdb database, so the database is sync'ed to the release of the real data they want to compare with? If that is the case, then my recently submitted PR would have fixed the problem by overwriting the (incorrect) tag energies from the merged randoms with the values from the (user-selected) ccdb database tagger energy tables, so that the reconstruction and analysis sees what it is expecting to see. That avoids needing to patch the reconstruction or analysis.

That PR sat pending on haldd_sim for too long, so I canceled it. -Richard Jones

On Wed, Mar 22, 2023 at 8:48 AM Peter Hurck @.***> wrote:

Right, if I am not mistaken this would fix things going forward, but most people running simulations to do acceptance correction use reconstruction and analysis software that are a bit older. Would we patch this into old versions?

— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/537#issuecomment-1479512616, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWFRP6NCH56TOXNWJ43W5LYI5ANCNFSM5AQA5RGA . You are receiving this because you were mentioned.Message ID: @.***>

s6pepaul commented 1 year ago

My point was that your pull request might fix this, but only if people use a hd_root version that includes that fix. People reporting this issue do not. They use hd_root compiled with the same software that was used months or even years ago. How could we help them?

rjones30 commented 1 year ago

Peter,

No, it has nothing to do with hd_root. The PR was to halld_sim, right? How does that couple to hd_root?

-Richard J.

On Wed, Mar 22, 2023 at 9:54 AM Peter Hurck @.***> wrote:

My point was that your pull request might fix this, but only if people use a hd_root version that includes that fix. People reporting this issue do not. They use hd_root compiled with the same software that was used months or even years ago. How could we help them?

— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/537#issuecomment-1479609582, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWEDJ3TXZKUYSP4A553W5MABFANCNFSM5AQA5RGA . You are receiving this because you were mentioned.Message ID: @.***>

s6pepaul commented 1 year ago

I thought you were talking about https://github.com/JeffersonLab/halld_recon/pull/718?

rjones30 commented 1 year ago

Peter,

The PR was https://github.com/JeffersonLab/halld_sim/pull/275

-Richard Jones

On Wed, Mar 22, 2023 at 10:05 AM Peter Hurck @.***> wrote:

I thought you were talking about #718 https://github.com/JeffersonLab/halld_recon/pull/718?

— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/537#issuecomment-1479627491, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWFDK4R7AEZIGVHIQT3W5MBK7ANCNFSM5AQA5RGA . You are receiving this because you were mentioned.Message ID: @.***>

s6pepaul commented 1 year ago

oh ok, I see. For some reason I thought this pull request is already included... Let me run a test and pull it in

s6pepaul commented 1 year ago

Should now be resolved by PR https://github.com/JeffersonLab/halld_sim/pull/275