Random trigger events missing from the skim for 73070

nsjarvis commented 1 month ago

The random trigger file at /cache/halld/gluex_simulations/random_triggers/recon-2019_11-ver01/run073070_random.hddm contains only 1882 events. Peter confirmed that that's what the MCWrapper database indicates for this run.

One can see even from the monitoring histogram which only sees a fraction of the events that there should be at least 100 x more than this.

According to the monitoring plots made along with the REST production, there should be 670692 random triggers for that run.

It's possible that the random trigger files could be too short for other runs as well. I ran demon over the REST production files to count the randoms in the monitoring histogram. The plots are on this page and the CSV data file is here and also attached: monitoring_data_2019-11_verREST1.csv

nsjarvis commented 1 month ago

Similar csv files for the other run period REST files are linked below

There's a script to extract a column here , use it like this python extract_col.py monitoring_data_2019-11_verREST1.csv Triggers/random_trig > randoms.csv

aaust commented 1 month ago

Something must have been wrong during the production or copying of this file. The converted random trigger file produced during the REST production is nearly 500x larger:

cat /mss/halld/RunPeriod-2019-11/recon/ver01/converted_random/merged/converted_random_073070.hddm | grep size
size=3349506500
cat /mss/halld/gluex_simulations/random_triggers/recon-2019_11-ver01/run073070_random.hddm  | grep size
size=7866066

nsjarvis commented 1 month ago

I did some digging into MCW's database thanks to Thomas for hints. There are many runs for that run period with not many random trigger events at all. They're mostly in run numbers that are close together, plus a few now and then later on. Comparing these with the number of events in the trigger count histo from the REST monitoring files, there are WAY less than there should be.
eg | 73070 | 1882 | | 73071 | 130233 | | 73072 | 4206 | | 73073 | 752 | | 73074 | 752 | | 73075 | 2659 | | 73076 | 258118 | | 73076 | 258118 | | 73077 | 94242 | | 73077 | 94242 | | 73078 | 752 | | 73079 | 2518 | | 73081 | 164337 | | 73082 | 3824 | | 73083 | 2202 | | 73085 | 360251 | | 73086 | 752 | | 73087 | 752 | | 73088 | 144075 | | 73089 | 20431 | | 73090 | 16623 | | 73092 | 752 | | 73093 | 1652 | | 73094 | 2781 | | 73095 | 752 | | 73096 | 752 | | 73099 | 1734 | | 73102 | 147 | | 73104 | 14119 | | 73108 | 752 | | 73112 | 2317 | | 73115 | 752 | | 73116 | 752 | | 73117 | 752 | | 73118 | 366036 | | 73119 | 3447 | | 73120 | 752 | | 73121 | 593 | | 73122 | 752 | | 73123 | 58694 |

nsjarvis commented 1 month ago

Here are counts from the trigger monitoring histogram (filled all the time) and then the random trigger counts (saved only when the beam was on) for a selection of runs. Ignore the *.

73125 135744 752 73126 460372 8574 73127 374715 752 73129 338699 752 73130 606070 3038 73131 217774 752 73132 677615 2051 73143 661752 444 73144 608251 2291 73145 726357 752 73146 469966 752 73147 683583 2449 73148 669812 2202 73149 351891 752 73150 258923 752 73151 698204 752 73160 105814 752 *

nsjarvis commented 1 month ago

I took a look at the earlier run periods to see what is normal. This is for a chunk of fall 2018. The columns are run, histogram_counts, beam_frac, histogram_counts x beam_frac, random_triggers_in_file where beam_frac is beam_current/beam_on_current

41216 357256 0.8 293947 288691 
41217 1154673 0.8 908378 882142 
41218 124420 0.6 74900 72868 
41220 1141974 0.7 829753 807054 
41221 819415 0.0 22980 18117   
41247 724698 0.6 443133 427250 
41250 1102418 0.8 870503 851370 
41251 139910 0.9 128049 127382 
41252 848648 0.9 749357 730721

nsjarvis commented 1 month ago

I've attached a file for each run period showing only the runs where the number of random triggers in the file is less than 0.7 x the expected number. The expected number is the number in the monitoring histogram x beam_current / beam_on_current. If beam_on_current is not in rcdb, it assumes that it is 1.5 x beam_current.

The columns in the file are run monitoring histogram random count beam_current/beam_on_current randoms expected (= monitoring histo count x ratio above) randoms found in MCW's database of random trigger file event counts number of events in the run / 1e6 randoms found/randoms expected '*' is appended if found/expected is less than 0.5 and there are more than 100M events

spring17.txt spring18.txt fall18.txt spring20.txt

There are a few runs in the earlier run periods which have much fewer randoms than expected, and many from spring 2020. Also, quite a few of the spring 20 runs have beam_current > beam_on_current, which seems odd/wrong.

If anyone else wants to have a play with this, the files that I used are in /work/halld/njarvis/randoms

The runs with missing beam_on_current are listed in this file:

missingbeamon.txt

sdobbs commented 1 month ago

Thanks a lot for looking into the details of this, @nsjarvis !

I looked into the 2019-11 runs, and indeed in many of these there is an issue with the beam fiducial table. I'll look more closely at this, figure out why it's happening, and put in some better checks to catch these problems earlier. We'll probably want to recreate the maps for these runs. As a reminder, the 2019-11 run period was the first one to implement this new beam fiducial method, so it looks like a few runs slipped past our QA as we worked out the bugs of this new technique.

For the earlier run periods, I will fix the runs with bad RCDB settings, or other obvious issues. We should discuss what to do with the other runs - my proposal would be to fix a limit below which we should try to recover more random triggers, for example 10k. For example, if we expect a run should have 500k random triggers when the beam is on, but only have 250k triggers for folding into simulations - clearly we could improve the situation, but it doesn't seem worth the non-trivial effort to do so, IMO.

For later run periods, we will want to double check things, but I think that the changes in our other procedures (i.e. properly calculating most of the info during REST production), lead to more consistent results, compared to going back to recalculate these values.

sdobbs commented 1 month ago

One quick update - I was able to originally reproduce this issue of bad fiducial tables, but after adding in and removing some debugging statements, the problem is no longer showing up with my build. So perhaps we are a victim of an over-optimizing compiler. I'll keep on checking this and work on reproducing the random files.

sdobbs commented 1 month ago

I've fixed the RCDB entries for 2017-01.

nsjarvis commented 1 month ago

Great, thanks.

sdobbs commented 1 month ago

@nsjarvis - I think that for some of the run periods (earlier than 2019-11), you were looking at the wrong set of random files.

For example, 41488 is listed as having zero events, but if I look at all of the entries for run 41488:

MariaDB [gluex_mc]> select Run_Number,Tag,Path,Num_Events from Randoms where Run_Number=41488;;
+------------+-----------------------+----------------------------------------------------------------------------------------+------------+
| Run_Number | Tag                   | Path                                                                                   | Num_Events |
+------------+-----------------------+----------------------------------------------------------------------------------------+------------+
|      41488 | recon-2018_01-ver02   | /w/halld-scifs17exp/random_triggers/recon-2018_01-ver02/run041488_random.hddm          |          0 |
|      41488 | recon-2018_01-ver02   | /w/osgpool-sciwork18/halld/random_triggers/recon-2018_01-ver02/run041488_random.hddm   |    1183612 |
|      41488 | recon-2018_01-ver02   | /osgpool/halld/random_triggers/recon-2018_01-ver02/run041488_random.hddm               |    1183612 |
|      41488 | recon-2018_01-ver02.2 | /w/osgpool-sciwork18/halld/random_triggers/recon-2018_01-ver02.2/run041488_random.hddm |     881884 |
+------------+-----------------------+----------------------------------------------------------------------------------------+------------+

Only the first has zero events, and corresponds to some ancient file. I don't know how MCWrapper chooses which file when there is multiple options... but in any case, for 2018-01, the correct tag to look at is "recon-2018_01-ver02.2".

Similarly, for 2018-08, all of the "missing" runs exist under the recon-2018_08-ver02.2 tag. They appear to be missing since they have a tag of "None" (whoops).

nsjarvis commented 1 month ago

That's possible, I thought I looked at the most recent set. I'll dig up my notes. It's easy to run the script again anyway. If you can tell me which tags to use, while I look for my notes, that will help.

sdobbs commented 1 month ago

Thanks, yeah, there is a list on this page: https://halldweb.jlab.org/wiki/index.php/How_to_choose_software_versions_on_the_MC_submission_form

But these are the versions to check: 2017-01 - recon-2017_01-ver03.2 2018-01 - recon-2018_01-ver02.2 2018-08 - recon-2018_08-ver02.2

I'm going to rerun some of the problem 2019-11 runs - my current best guess is that there was some memory error that was overwriting the threshold used to determine if the beam is on or not. Will use additional TLC this time.

nsjarvis commented 1 month ago

I did use older recon tags in MCW, sorry. This is what I find using the correct tags, and after the beam_on_current upload for spring 17: repeat_spring17.txt repeat_spring18.txt repeat_fall18.txt repeat_spring20.txt

sdobbs commented 1 month ago

Thanks! I will look into filling in the missing files for spring/fall 18. Luckily these all seem to be very short runs. I'll fix as many of the problematic runs.

As for the runs missing ~50% of the events, I don't think these are so urgent, since there are still 10s or 100s of thousands of mix-in events. If we wanted to improve on this, it seems like a project for a student.

sdobbs commented 1 month ago

I copied runs 41173, 41386, 42182, 51426, 51172 to the correct locations under /cache/halld/gluex_simulations/random_triggers so they should be incorporated

41221 needs to have its fiducial map calculated

JeffersonLab / halld_sim

Random trigger events missing from the skim for 73070 #342