JeffersonLab / halld_sim

Simulation for the GlueX Experiment in Hall D
6 stars 10 forks source link

Missing simulated TOF hits in smeared files after adding random triggers #279

Closed jrstevenjlab closed 1 year ago

jrstevenjlab commented 1 year ago

Looking at a recent MCWrapper project I noticed that certain run numbers had significantly fewer TOFPoints than others. The figure below shows the number of TOFPoints/event for a 4 nearby runs, where the upper panels show a reasonable number of TOFPoints on average, but the bottom two panels show runs where there are very few TOFPoints reconstructed

exampleTOFpoints_phi2pi_17_2023.pdf

The mean TOFPoints/event is plotted vs run number in the figure below where its clear there are large numbers of runs in the range 30614-30800, where the number of observed TOFPoints is well below the expectation

tofPointsVsRun_phi2pi_17_2023.pdf

Finally, when you look at the matching of these TOFPoints to tracks for these 4 nearby runs, the two with significant numbers of TOFPoints show correlations with tracks as expected, but those with a low number of TOFPoints are not correlated with tracks (very few have distances between track and TOFPoint < 6 cm)

exampleTOFmatch_phi2pi_17_2023.pdf

This seems to indicated that the TOFPoints from random triggers are being included in the simulated events, but the TOF hits from the simulated signal events are not. This appears to occur in several other MCWrapper projects and also in simulations @aaust produced on the JLab farm, but typically for different ranges of run numbers in different samples. So the effect is not directly reproducible.

The plots shown above are from MCWrapper project #3093, which can be found at /cache/halld/gluex_simulations/REQUESTED_MC/phi2pi_17_20230221025001pm/root/monitoring_hists/

@aaust has a separate sample of rho events produced on the JLab farm which shows shows the same problem for an example file /work/halld2/home/aaustreg/Analysis/rho/simulation/sdme/ifarm_2017_ver03_ver50_keep_all/root/monitoring_hists/hd_root_gen_amp_030460_000.root

In this case the hdgeant4 and smeared output were saved (in the same hddm/ path as above) and the simulated TOF hits are included in the hdgeant4 file before smearing

/work/halld2/home/aaustreg//Analysis/rho/simulation/sdme/ifarm_2017_ver03_ver50_keep_all/hddm/gen_amp_030460_000_geant4.hddm

but are missing in the smeared file

/work/halld2/home/aaustreg//Analysis/rho/simulation/sdme/ifarm_2017_ver03_ver50_keep_all/hddm/gen_amp_030460_000_geant4_smeared.hddm

So, is there some mechanism for losing simulated TOF hits when we mix the simulated event with the random trigger hits in the mcsmear step? And if so, why is it run number dependent and not reproducible? Any suggestions for further tests or studies to understand this bug are appreciated

rjones30 commented 1 year ago

To attempt to reproduce this issue, I carried out the following steps:

  1. Copied all random hits files from version recon-2017_01-ver03.2 for runs 30274 to 31057 from tape at jlab to uconn.
  2. modified MCwrapper script to adapt the jlab-specific paths to uconn cluster equivalents
  3. ran the same command indicated above $MCWRAPPER_CENTRAL/gluex_MC.py MC_2k.config 30274-31057 1000000 batch=2 cleangeant=0 cleanmcsmear=0
  4. waited for the outputs from all of these jobs to appear on my mass storage (329 files of type gen_amp_0XXXXX_000_geant4_smeared.hddm, plus associated logs and root files)
  5. looped over all of the hd_root_gen_amp_0XXXXX_000.root files and extracted the mean number of tof hits per event from /Independent/Hist_NumReconstructedObjects/NumTOFHits with statistical error (1/sqrt(N))
  6. plotted the results vs run number -- see below. Looks normal, no problems visible.

image

rjones30 commented 1 year ago

Followup question: it seems like MCwrapper just assumes that the randoms files will be present on the /cache disk. What does MCwrapper do to check this? If they have been removed from cache, does it know what to do? -Richard

T-Britton commented 1 year ago

The files are “permanently pinned”. Which prevents them falling off cache. There is also a secondary copy which is rsync’d to the xrootd server.

Problems can still occur but even if I could guarantee existence anywhere an internet outage will spoil it. I could be wrong but every case of missing randoms come from the asynchronous trigger being disabled. Sean usually replaces them with the ps triggers when notified.

Thomas Britton

On May 15, 2023, at 5:04 PM, Richard Jones @.***> wrote:



Followup question: it seems like MCwrapper just assumes that the randoms files will be present on the /cache disk. What does MCwrapper do to check this? If they have been removed from cache, does it know what to do? -Richard

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JeffersonLab_halld-5Fsim_issues_279-23issuecomment-2D1548574784&d=DwMCaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=ccFffx721N71hPpKcJGvJIqY9RM4gBTuzp9ir7rze5Q&m=KFo0cNQbZy-SVRbLLKjvB9NywyqKAyJA_qy45y_9pyfJVpQiIw_ZCmrwR6eKI9zg&s=J7j_q-7Fo08Cunx9icpYhouN-G0zBp8mKus_lOEISfk&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AFRO2BEJQN7ZW4MW3B6QRH3XGKK4TANCNFSM6AAAAAAV3XHNX4&d=DwMCaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=ccFffx721N71hPpKcJGvJIqY9RM4gBTuzp9ir7rze5Q&m=KFo0cNQbZy-SVRbLLKjvB9NywyqKAyJA_qy45y_9pyfJVpQiIw_ZCmrwR6eKI9zg&s=4owCBjH79pMaB55p_Sop_9FliUzW1PHn9tFputNwlcI&e=. You are receiving this because you are subscribed to this thread.Message ID: @.***>

rjones30 commented 1 year ago

Zooming in on this one run 30460 (see Alexander's comment above) I see this run looks normal in my test. image

Thomas, why do you say these drop-out runs are missing async randoms triggers? See this run 30460 that Alexander highlights above. Here is the randoms file for that run on my disk. It is certainly not missing. -rw-r--r-- 1 gluex Gluex 547848739 May 8 13:40 run030460_random.hddm

rjones30 commented 1 year ago

On the other hand, this run is missing from the cache disk. Maybe the "pinning" is not working? ifarm1802.jlab.org> ls /cache/halld/gluex_simulations/random_triggers/recon-2017_01-ver03.2/run030460_random.hddm ls: cannot access /cache/halld/gluex_simulations/random_triggers/recon-2017_01-ver03.2/run030460_random.hddm: No such file or directory ifarm1802.jlab.org> ls /mss/halld/gluex_simulations/random_triggers/recon-2017_01-ver03.2/run030460_random.hddm /mss/halld/gluex_simulations/random_triggers/recon-2017_01-ver03.2/run030460_random.hddm

T-Britton commented 1 year ago

It looks like that is the case. I’ll talk to ying tomorrow. Also after the 2FA I may need to redo the rsync….adding it to the list

Thomas Britton

On May 15, 2023, at 5:47 PM, Richard Jones @.***> wrote:



On the other hand, this run is missing from the cache disk. Maybe the "pinning" is not working? ifarm1802.jlab.org> ls /cache/halld/gluex_simulations/random_triggers/recon-2017_01-ver03.2/run030460_random.hddm ls: cannot access /cache/halld/gluex_simulations/random_triggers/recon-2017_01-ver03.2/run030460_random.hddm: No such file or directory ifarm1802.jlab.org> ls /mss/halld/gluex_simulations/random_triggers/recon-2017_01-ver03.2/run030460_random.hddm /mss/halld/gluex_simulations/random_triggers/recon-2017_01-ver03.2/run030460_random.hddm

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JeffersonLab_halld-5Fsim_issues_279-23issuecomment-2D1548638619&d=DwMCaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=ccFffx721N71hPpKcJGvJIqY9RM4gBTuzp9ir7rze5Q&m=823RrtUUNLU-3YOQcScNoBpVBMktFBZHOXerk0mqgDCwRHNEMmPehaJqKx-qNxAI&s=kOMonumI4uMuRUyS02KTc7bl4ci0LyUuys6nVb90YH0&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AFRO2BBOAVXBHWTFB2PRVA3XGKP6XANCNFSM6AAAAAAV3XHNX4&d=DwMCaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=ccFffx721N71hPpKcJGvJIqY9RM4gBTuzp9ir7rze5Q&m=823RrtUUNLU-3YOQcScNoBpVBMktFBZHOXerk0mqgDCwRHNEMmPehaJqKx-qNxAI&s=-4GDcT1YkjFukSe-qVhLjze4p-V0xZkY8gMBoXCnlWU&e=. You are receiving this because you commented.Message ID: @.***>

T-Britton commented 1 year ago

I am putting these thoughts here for our meeting later: if we have runs where TOF hits are not present and both hdgeant and mcsmear output are saved should we not be using the hdgeant output as input to mcsmear. Run mcsmear a bunch of times and see if we can catch the behavior again? This would take MCwrapper out of the equation (I am unsure on a mechanism to cause this in MCwrapper). We can use Justin's and Alex's observations to estimate rate and then do the insane thing of running the same code a bunch of time expecting a different outcome. If we catch the behavior the problem is in mcsmear (probably rng related) and if we have a large enough statistical sample and do not observe the behavior it is probably something in MCwrapper. It is good that tests were run not on OSG that observed the behavior as that removes a large number of potential "weirdness" which could interfere in mysterious ways.

rjones30 commented 1 year ago

Thomas, here are the changes I had to make to gluex_MCwrapper to make the jobs run on the UConn cluster.

@.*** gluex_MCwrapper]$ git status

On branch master

Changes not staged for commit:

(use "git add ..." to update what will be committed)

(use "git checkout -- ..." to discard changes in working

directory) #

modified: MakeMC.sh

modified: osg-container.sh

#

The osg-container.sh in the github repo is a dummy, so I copied in my own container wrapper script, no need for further discussion of that. All of the other changes are in MakeMC.sh, shown below.

@.*** gluex_MCwrapper]$ git diff MakeMC.sh diff --git a/MakeMC.sh b/MakeMC.sh index ae58526..ef8a5c5 100755 --- a/MakeMC.sh +++ b/MakeMC.sh @@ -1,5 +1,7 @@

!/bin/bash

+source /home/halld/setup.sh +

SET INPUTS

export BATCHRUN=$1 shift @@ -192,6 +194,8 @@ export USER_BC='/usr/bin/bc' export USER_STAT='/usr/bin/stat' fi

+cd $_CONDOR_SCRATCH_DIR +

printenv

necessary to run swif, uses local directory if swif=0 is used

if [[ "$BATCHRUN" != "0" ]]; then @@ -209,6 +213,7 @@ if [[ "$BATCHSYS" == "QSUB" ]]; then cd $RUNNING_DIR fi

+echo running in workdir $(pwd)

if [[ ! -d $RUNNING_DIR/${RUNNUMBER}${FILE_NUMBER} ]]; then @@ -645,7 +650,8 @@ if [[ "$BKGFOLDSTR" == "DEFAULT" || "$bkgloc_pre" == "loc:" || "$BKGFOLDSTR" ==

bkglocstring="$XRD_RANDOMS_URL/random_triggers/$RANDBGTAG/run$formatted_runNumber""_random.hddm" fi else

bkglocstring="/work/osgpool/halld/random_triggers/"$RANDBGTAG"/run"$formatted_runNumber"_random.hddm" +

bkglocstring="/work/osgpool/halld/random_triggers/"$RANDBGTAG"/run"$formatted_runNumber"_random.hddm"

+ bkglocstring="$XRD_RANDOMS_URL/random_triggers/"$RANDBGTAG"/run"$formatted_runNumber"_random.hddm" if [[ hostname == ' scosg16.jlab.org' || hostname == 'scosg20.jlab.org' || hostname == ' scosg2201.jlab.org' ]]; then

bkglocstring="/work/osgpool/halld/random_triggers/"$RANDBGTAG"/run"$formatted_runNumber"_random.hddm" fi

On Thu, May 25, 2023 at 7:44 AM T-Britton @.***> wrote:

I am putting these thoughts here for our meeting later: if we have runs where TOF hits are not present and both hdgeant and mcsmear output are saved should we not be using the hdgeant output as input to mcsmear. Run mcsmear a bunch of times and see if we can catch the behavior again? This would take MCwrapper out of the equation (I am unsure on a mechanism to cause this in MCwrapper). We can use Justin's and Alex's observations to estimate rate and then do the insane thing of running the same code a bunch of time expecting a different outcome. If we catch the behavior the problem is in mcsmear (probably rng related) and if we have a large enough statistical sample and do not observe the behavior it is probably something in MCwrapper. It is good that tests were run not on OSG that observed the behavior as that removes a large number of potential "weirdness" which could interfere in mysterious ways.

— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_sim/issues/279#issuecomment-1562759752, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWCIMRKMW5J7GTAIQGDXH5A2NANCNFSM6AAAAAAV3XHNX4 . You are receiving this because you commented.Message ID: @.***>

rjones30 commented 1 year ago

Hello all,

With the latest version of MCwrapper, I went back and ran the full set of simulation runs that Alexander showed in his demonstration test, 338 files in all. I do not see any examples of dropping tof hits in my complete sample (see second plot below) whereas Alex sees 17 (first plot below). If this is a random thing, the probability that it is there in my UConn instance and just not seen by statistical fluctuation is exp(-17), which is negligible. So either it is not random, or I am very (un)lucky.

-Richard Jones

Screenshot 2023-05-26 7 00 20 AM Screenshot 2023-05-26 6 57 23 AM

s6pepaul commented 1 year ago

Hi Richard,

just to confirm: your randoms come also from JLab, right (export XRD_RANDOMS_URL=root://sci-xrootd.jlab.org//osgpool/halld/)? Or did the JLab connection test fail and yours were all streamed from UConn?

I can't remember how the files get to root://nod25.phys.uconn.edu/Gluex/rawdata/, ie. if they are exactly the same as the ones at JLab or could be older versions.

Cheers, Peter

PS I couldn't see the two plots you attached, was that just me?

s6pepaul commented 1 year ago

Can you share your MC.config file, Richard?

jrstevenjlab commented 1 year ago

I also can't see the plots you refer to in the post Richard.

rjones30 commented 1 year ago

Peter,

It gets the randoms file from wherever MCwrapper looks for it. I didn't give it any hints about UConn stash locations, but it may know about them. Here is a line from the gluexMC.py script in MCwrapper that seems relevant.

os.system("scp sci-xrootd.jlab.org:/osgpool/halld/"+"/random_triggers/"+RANDBGTAG+"/run"+formattedRUNNUM+"_random.hddm /tmp/"+RANDBGTAG)

The version of the randoms is identified by the full directory path, which ends with recon-2017_01-ver03.2, this ver03.2 is a unique identifier of the source of these randoms, I believe. I preserved it when I make copies from Jlab at UConn.

-Richard Jones

On Fri, May 26, 2023 at 7:46 AM Peter Hurck @.***> wrote:

Hi Richard,

just to confirm: your randoms come also from JLab, right (export XRD_RANDOMS_URL=root://sci-xrootd.jlab.org//osgpool/halld/)? Or did the JLab connection test fail and yours were all streamed from UConn?

I can't remember how the files get to root:// nod25.phys.uconn.edu/Gluex/rawdata/, ie. if they are exactly the same as the ones at JLab or could be older versions.

Cheers, Peter

PS I couldn't see the two plots you attached, was that just me?

— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_sim/issues/279#issuecomment-1564268921, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWGILXRWESNZUSE26MLXICJ2TANCNFSM6AAAAAAV3XHNX4 . You are receiving this because you commented.Message ID: @.***>

rjones30 commented 1 year ago

Here is my MC.config file, which I took from Alexander and made minimal changes for the UConn cluster context.

THESE TWO ARE OPTIONAL IF THE STANDARD RUNNING DOESN'T SUIT YOUR NEEDS

CUSTOM_MAKEMC=use-this-script-instead

CUSTOM_GCONTROL=use-this-Gcontrol-instead

========================================================================

VARIATION=default # calibtime=timegoeshere #set your jana calib context

here with or without calibtime Default is variation=mc

RECON_CALIBTIME=2021-01-01-00-00-01

RUNNING_DIRECTORY=/run/in/this/directory #where the code should run. This

is defaulted to ./. Use only when NEEDED

ccdbSQLITEPATH=/home/halld/HDGeant4/jlab/test/makeMC/ccdb_NewAlignment_Vertex.sqlite rcdbSQLITEPATH=/cvmfs/ oasis.opensciencegrid.org/gluex/group/halld/www/halldweb/html/dist/rcdb.sqlite

@._production and @status_approved and (polarization_angle==0.0 or polarization_angle==90.0 or polarization_angle==45.0 or polarization_angle==135.0) @._production and @status_approved

TAG=my-custom-prefix-tag

CHANGE HERE!!!

DATA_OUTPUT_BASE_DIR=/home/halld/HDGeant4/jlab/test

DONE

NCORES=4 # Number of CPU threads to use or nodes:node-id:ppn or nodes:ppn depending on your system

GENERATOR=gen_amp #or you may specifile file:/.../file-to-use.hddm GENERATOR_CONFIG=/home/halld/HDGeant4/jlab/test/makeMC/gen_2k_4_4.cfg

common parameters for generators

eBEAM_ENERGY=12 #either use rcdb or do not set to pull the value for the

chosen run number from the rcdb

RADIATOR_THICKNESS=50.e-06#either use rcdb or do not set to pull the value

for the chosen run number from the rcdb

COHERENT_PEAK=9 #either use rcdb or do not set to pull the value for the

chosen run number from the rcdb

GEN_MIN_ENERGY=6.0

GEN_MAX_ENERGY=11.4

GEN_MIN_ENERGY=8.2 GEN_MAX_ENERGY=8.8

GEANT_VERSION=4 BKG=Random:recon-2017_01-ver03.2 #[None, BeamPhotons, TagOnly, custom e.g bg.hddm:1.8] Can be stacked eg Random+TagOnly:.123 where the :[num] defines BGRATE

BKG=None #[None, BeamPhotons, TagOnly, custom e.g bg.hddm:1.8] Can be

stacked eg Random+TagOnly:.123 where the :[num] defines BGRATE

optional additional plugins that will be run along side danarest and

hd_root. This should be a comma separated list (e.g. plugin1,plugin2) CUSTOM_PLUGINS= file:/home/halld/HDGeant4/jlab/test/makeMC/jana_analysis_2k.config

/.../file-to-use which is a configuration file for

jana#====================================================================================

EVERYTHING BELOW FOR BATCH ONLY

VERBOSE=True

BATCH_SYSTEM=condor #can be swif or condor or osg or qsub adding :[name] will pass -q [name] into PBS.

environment file location

ENVIRONMENT_FILE=/cvmfs/ oasis.opensciencegrid.org/gluex/group/halld/www/halldweb/html/halld_versions/recon-2019_11-ver01_8.xml

ANA_ENVIRONMENT_FILE=/cvmfs/ oasis.opensciencegrid.org/gluex/group/halld/www/halldweb/html/halld_versions/analysis-2017_01-ver61.xml

WORKFLOW_NAME=2k_mc #SWIF WORKFLOW NAME PROJECT = halld # http://scicomp.jlab.org/scicomp/#/projects TRACK= production # https://scicomp.jlab.org/docs/batch_job_tracks

RESOURCES for swif jobs

DISK=5GB # Max Disk usage RAM=5GB # Max RAM usage TIMELIMIT=300minutes # Max walltime. This may be of the form xx:xx:xx depending on your system OS=general # Specify CentOS65 machines

On Fri, May 26, 2023 at 8:09 AM Peter Hurck @.***> wrote:

Can you share your MC.config file, Richard?

— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_sim/issues/279#issuecomment-1564296520, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWDTAOWLWOQEIILZ35DXICMO3ANCNFSM6AAAAAAV3XHNX4 . You are receiving this because you commented.Message ID: @.***>

rjones30 commented 1 year ago

Making a new entry, copied from above with plots. I guess github does not accept email attachments in replies. Screenshot 2023-05-26 7 00 20 AM Screenshot 2023-05-26 6 57 23 AM

s6pepaul commented 1 year ago

Thanks for posting these plots Richard. I find it interesting that even for the runs where Alex sees TOF hits, you have more. Your average is systematically above 10^2 Alex has systematically less than 10^2 hits. I wonder if that points to something more fundamental then just a "random" issue in some files.

s6pepaul commented 1 year ago

One more request: Can you provide me with an MCwrapper log file?

rjones30 commented 1 year ago

Peter,

There are many possible interpretations of the MCwrapper log file, which one of these:

  1. the stdout/stderr spit out by the original MCwrapper command that spawns the jobs.
  2. the stdout from a job completed by condor
  3. the stderr from a job completed by condor
  4. the "condor log file" that records jobs being scheduled, tracked during runtime, and completing with exit code

-Richard Jones

On Fri, May 26, 2023 at 9:23 AM Peter Hurck @.***> wrote:

One more request: Can you provide me with an MCwrapper log file?

— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_sim/issues/279#issuecomment-1564389502, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWFRVM5RSGF7W44Z47DXICVFFANCNFSM6AAAAAAV3XHNX4 . You are receiving this because you commented.Message ID: @.***>

T-Britton commented 1 year ago

2./3.

having the terminal command you used to run everything could also be useful

rjones30 commented 1 year ago

That may be because I am using the special sqlite file ccdb_NewAlignment_Vertex.sqlite whereas Alexander is hitting the ccdb live server in his jobs, according to the job logs. If I try to do that, after the first 100 or so jobs start up, the external-facing mysql server stops talking to me. -Richard Jones

On Fri, May 26, 2023 at 9:19 AM Peter Hurck @.***> wrote:

Thanks for posting these plots Richard. I find it interesting that even for the runs where Alex sees TOF hits, you have more. Your average is systematically above 10^2 Alex has systematically less than 10^2 hits. I wonder if that points to something more fundamental then just a "random" issue in some files.

— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_sim/issues/279#issuecomment-1564384180, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWDVUTUHQOZSTJFVHQ3XICUU3ANCNFSM6AAAAAAV3XHNX4 . You are receiving this because you commented.Message ID: @.***>

T-Britton commented 1 year ago

there is a copy that, provided you don't supply a special file, gets used out on the osg.

export CCDB_CONNECTION=sqlite:////group/halld/www/halldweb/html/dist/ccdb.sqlite (note this is the oasis group...group as seen from the container)

rjones30 commented 1 year ago

Here is a sample stdout and stderr file from a typical run of Alexander's test on the uconn site. The command I used to initiate the jobs is below. -Richard

$MCWRAPPER_CENTRAL/gluex_MC.py MC_2k.config 30274-31057 10000000 batch=2 cleangeant=0 cleanmcsmear=0

out_2k_mc_31002_3.log error_2k_mc_31002_3.log

jrstevenjlab commented 1 year ago

Hi Richard,

I just noticed a couple possibly relevant differences in your log file in terms of software versions

From your log file:

Environment file: /cvmfs/oasis.opensciencegrid.org/gluex/group/halld/www/halldweb/html/halld_versions/recon-2019_11-ver01_8.xml Analysis Environment file: /cvmfs/oasis.opensciencegrid.org/gluex/group/halld/www/halldweb/html/halld_versions/analysis-2017_01-ver61.xml

From Alex's log file /work/halld2/home/aaustreg//Analysis/rho/simulation/sdme/ifarm_2017_ver03_ver50_keep_all/log/30460_stdout.30460_0.out:

Environment file: /group/halld/www/halldweb/html/halld_versions/recon-2017_01-ver03_34.xml Analysis Environment file: /group/halld/www/halldweb/html/halld_versions/analysis-2017_01-ver50.xml

This may not be the cause of the missing hits but could be related to the systematic difference average TOF hits for the good runs. Could you retry at least a subset of the test with the match versions to Alex's log file?

-Justin

rjones30 commented 1 year ago

Ok, I can do that. I also see differences in the versions of the GlueX software being used. Before I rerun, I would like to figure out how to get MCwrapper to start the same versions of the software as it is giving Alex.

Alex: =======SOFTWARE USED======= MCwrapper version v2.6.1 MCwrapper location /group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/gluex_MCwrapper/gluex_MCwrapper-v2.6.2 Streaming via xrootd? 1 Event Count: 219324 BC /usr/bin/bc python /apps/bin/python /group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/halld_sim/halld_sim-4.43.0^rec170139/Linux_CentOS7.7-x86_64-gcc4.8.5/bin/gen_amp /group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/hdgeant4/hdgeant4-2.34.0^rec170139/bin/Linux-g++/hdgeant4 /group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/halld_sim/halld_sim-4.43.0^rec170139/Linux_CentOS7.7-x86_64-gcc4.8.5/bin/mcsmear /group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/halld_recon/halld_recon-recon-2017_01-ver03.9/Linux_CentOS7.7-x86_64-gcc4.8.5/bin/hd_root

me: =======SOFTWARE USED======= MCwrapper version v2.7.0 MCwrapper location /home/halld/gluex_MCwrapper LDPRELOAD: /usr/lib64/libXrdPosixPreload.so Streaming via xrootd? 1 Event Count: -1 BC /usr/bin/bc python /usr/bin/python /home/halld/halld_sim/Linux_CentOS7-x86_64-gcc4.8.5/bin/gen_amp /home/halld/HDGeant4/jlab/bin/Linux-g++/hdgeant4 /home/halld/halld_sim/Linux_CentOS7-x86_64-gcc4.8.5/bin/mcsmear /home/halld/halld_recon/Linux_CentOS7-x86_64-gcc4.8.5/bin/hd_root

T-Britton commented 1 year ago

Maybe Alex can post his MC config file. Then just a few modifications (eg paths) but you will know all the files that are being referenced.

That environment file can be an xml. If MCwrapper sees a .xml is assumes build scripts and will try to find it in /group/ (oasis if on OSG).

Message ID: @.***>

aaust commented 1 year ago

For the test above, I used the following xml files:

ENVIRONMENT_FILE=/group/halld/www/halldweb/html/halld_versions/recon-2017_01-ver03_34.xml
ANA_ENVIRONMENT_FILE=/group/halld/www/halldweb/html/halld_versions/analysis-2017_01-ver50.xml

Since then, I have repeated the test with newer version, but experienced the same issue in a few cases:

ENVIRONMENT_FILE=/group/halld/www/halldweb/html/halld_versions/recon-2017_01-ver03_35.xml
ANA_ENVIRONMENT_FILE=/group/halld/www/halldweb/html/halld_versions/analysis-2017_01-ver60.xml
rjones30 commented 1 year ago

Going back to the actual binaries that were used in Alex's demonstration of this issue, I am now able to reproduce it at UConn. See below, there are actually 3 files out of 1138 that produced the mean tof hits count around 5. I now need to figure out what these builds are formed out of, I presumed tagged versions? -Richard Jones

image

T-Britton commented 1 year ago

Great!

Message ID: @.***>

They are indeed tagged versions. I think Alex’s post of the MC.config will have the xml used…which lists all the versions. I can try to pull up the list in a bit

T-Britton commented 1 year ago

more /group/halld/www/halldweb/html/halld_versions/recon-2017_01-ver03_34.xml ?xml version="1.0" encoding="UTF-8"?> ?xml-stylesheet type="text/xsl" href="https://halldweb.jlab.org/halld_versions/version7.xsl"?> gversions file="recon-2017_01-ver03_34.xml" date="2022-11-23"> description>Recon launch compatible version based on 5.9.0 package name="amptools" version="0.14.5" dirtag="root60806"/> package name="ccdb" version="1.06.07"/> package name="cernlib" version="2005" word_length="64-bit"/> package name="diracxx" version="2.0.1"/> package name="evio" version="4.4.6"/> package name="evtgen" version="01.07.00"/> package name="geant4" version="10.02.p02"/> package name="gluex_MCwrapper" version="v2.6.2"/> package name="gluex_root_analysis" version="1.23.0" dirtag="rec170139"/> package name="halld_recon" version="recon-2017_01-ver03.9"/> package name="halld_sim" version="4.43.0" dirtag="rec170139"/> package name="hdds" version="4.6.0" dirtag="bs221"/> package name="hdgeant4" version="2.34.0" dirtag="rec170139"/> package name="hd_utilities" version="1.44"/> package name="hepmc" version="2.06.10"/> package name="jana" version="0.7.9p1" dirtag="bs221"/> package name="lapack" version="3.6.0"/> package name="photos" version="3.61"/> package name="rcdb" version="0.06.00"/> package name="root" version="6.08.06" dirtag="bs221"/> package name="sqlitecpp" version="2.2.0" dirtag="bs130"/> package name="sqlite" version="3.13.0" year="2016" dirtag="bs130"/> package name="xerces-c" version="3.1.4"/> /gversions>

note the mcwrapper version cited here isn't used for submitting...only for ancillary files needed by some generators

rjones30 commented 1 year ago

There is a longstanding bug in TOFSmear.cc, coming from uninitialized calibration constants in the tof_calib_t object. It happens because the compiler switch DTOFGEOEMTRY is not initialized, and the section of code that is selected when DTOFGEOMETRY is undefined does not load values from ccdb into certain constants in tof_calib_t (), nor does it assign default values for these constants, with the result that they take on whatever values were there in memory when the constructor was called. This affects all simulations of the TOF, not just the small handful of runs where the simulation tof hits are missing altogether. It does not usually kill off the TOF hits altogether, usually it just applies an arbitrary attenuation correction. If we don't use the TOF pulse height for anything, it may not have a big impact.

In looking for this bug, I found a couple of others too, related to the drift chamber hits. They may be less serious, but I am not sure. I think I will submit them one by one, as separate pull requests, since they can have different effects.

rjones30 commented 1 year ago

As a side comment, debugging this was made more difficult by the fact that all debug symbols are being stripped from the binaries in the container. This means that I had to disassemble the compiled code and follow the execution at the machine instruction level to find the bug. It did not occur in the code that was compiled with symbols enabled, at least not at the same places. The advantage is that I have now refreshed my memory of the x86_64 instruction set, and what the main cpu registers are used for by the gcc optimizer. Unfortunately I will forget it all again tomorrow.

jrstevenjlab commented 1 year ago

Thanks a lot for tracking this down Richard and the PR #287

So if I understand properly, arbitrary attenuation lengths have been used for 2017-2018 simulation since the attenuation length was added to the TOF energy deposition in July 2020 https://github.com/JeffersonLab/halld_sim/commit/c5a1b0f771dea9f1fd80f8b18a8973cc7b57ef04

We don't use the TOF dE for much in user analysis but there is a threshold applied when building DTOFPoints

https://github.com/JeffersonLab/halld_recon/blob/master/src/libraries/TOF/DTOFPoint_factory.cc#L47

that would be affected by this but since the hits may be missing anyway based on the cut in mcsmear, so I don't think its a big impact from the wrong dE values beyond the missing hits.

That said, these missing hits could create an inefficiency in charged tracks reconstruction from worse start times or not having an associated fast detector match that's required in the default analysis library. For analyses that rely on the TOF for strict PID the effect could be more significant.

rjones30 commented 1 year ago

I have now submitted a second, independent, pull request for a second bug fix. This one has been there since the introduction of random hits merging a long time ago. Every few events in 10k that I looked at, there was a situation in the random hits merging (usually tagger or drift chamber hits) where the algorithm was accessing memory past the end of an array. The results are unpredictable, but they should only affect a few events per 10k. Still, it should be fixed. This PR has been tested and shown to fix this problem.

sdobbs commented 1 year ago

Thanks a lot for tracking this down, Richard. One note - I think this new pull request, #288, still includes the proposed changes to the TOF smearing classes.

rjones30 commented 1 year ago

One note - I think this new pull request, #288 https://github.com/JeffersonLab/halld_sim/pull/288, still includes the proposed changes to the TOF smearing classes.

Yes, the two PRs are sequential, if you like them both you can execute them at once, or just the first one with changes, and I can resubmit the second one. -Richard J.

On Thu, Jun 8, 2023 at 4:59 PM Sean Dobbs @.***> wrote:

Thanks a lot for tracking this down, Richard. One note - I think this new pull request, #288 https://github.com/JeffersonLab/halld_sim/pull/288, still includes the proposed changes to the TOF smearing classes.

— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_sim/issues/279#issuecomment-1583334336, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWDC72HX3SWXRJWCHQ3XKI4KZANCNFSM6AAAAAAV3XHNX4 . You are receiving this because you commented.Message ID: @.***>

s6pepaul commented 1 year ago

Ok, just to summarise: Sean removed the switches from the TOF in mcsmear in PR #289. This supersedes Richard's PR #287. But we still need to include the fixes in PR #288 (minus the 287 parts). This requires Richard to resubmit the second bugfix. Sean made the necessary change to recon-2017_01-ver03-sim so we can tag a new version of the old 2017_01-ver03 recon code. Changes to newer versions are not required as this is already included (albeit coded slightly differently, e.g. https://github.com/JeffersonLab/halld_recon/blob/cf30b83b743e9216cb06ce8a0b28291447804d08/src/libraries/TOF/DTOFGeometry.h#L33)

So, as soon as we merge Richard's bug fix we can tag new versions and make new builds for MC production. All correct?

jrstevenjlab commented 1 year ago

Thanks for the summary @s6pepaul. That's my understanding with one addition: the function DTOFGeometry::Get_CCDB_DirectoryName() needed to be added to both recon-2018_01-ver02-sim and recon-2018_08-ver02-sim, which @sdobbs did yesterday. So I believe all of the 2017 and 2018 sim branches of halld_recon need new tags and that could be done in parallel to Richard's resubmission of #288

rjones30 commented 1 year ago

Very good, my PR for the fix to the hits merging bug has been updated, should be ready to go as soon as it passes checks. -Richard Jones

On Tue, Jun 13, 2023 at 7:44 AM Justin Stevens @.***> wrote:

Thanks for the summary @s6pepaul https://github.com/s6pepaul. That's my understanding with one addition: the function DTOFGeometry::Get_CCDB_DirectoryName() needed to be added to both recon-2018_01-ver02-sim and recon-2018_08-ver02-sim, which @sdobbs https://github.com/sdobbs did yesterday. So I believe all of the 2017 and 2018 sim branches of halld_recon need new tags and that could be done in parallel to Richard's resubmission of #288 https://github.com/JeffersonLab/halld_sim/pull/288

— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_sim/issues/279#issuecomment-1589139390, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWGPQJVOLVYGZHGQ3NLXLBHCDANCNFSM6AAAAAAV3XHNX4 . You are receiving this because you commented.Message ID: @.***>

jrstevenjlab commented 1 year ago

Thanks @rjones30, #288 is merged. Looks like we're ready for new tagged versions.

jrstevenjlab commented 1 year ago

With the patched halld_recon branch and the latest halld_sim master here is a short test with 10M rho events from all runs in the 2017_01 period (similar to Alex's original test). Plotted is the average # of TOF points for each run in the sample, which doesn't show any outlier runs with very few hits that was the original symptom of the TOF hits being missing from the signal MC. We should still check some larger samples that are produced by MCWrapper with new tagged versions, but I'm closing this issue at this point.

toffPointsVsRun

Output files can be found at: /work/halld2/home/jrsteven/analysisGluexI/tof_test/recon-2017_01-ver03-sim/

sdobbs commented 1 year ago

Thanks a lot to everyone involved in tracking this down!

jrstevenjlab commented 1 year ago

Closing the loop on this: @s6pepaul tagged new version sets which fix this issue that should be used for any new MC production https://mailman.jlab.org/pipermail/halld-offline/2023-June/008917.html

For analyzers who would like to check if this issue affects their existing simulation samples, you can use a ROOT macro to plot TOF hits vs run number from the standard monitoring_hists produced by MCWrapper. An example of such a macro can be found at /work/halld2/home/jrsteven/forTOF_test/plotTOF_MissingHits.C

Below are results from some recent samples which show the problem affects 2017 and 2018 periods and varies a lot from run to run. The 2018_08 period for these samples appears to be particularly affected, but this is a random effect from an uninitialized variable so no generalizations should be made.

toffHitsVsRun_2017_01 toffHitsVsRun_2018_01 toffHitsVsRun_2018_08

Paths for monitoring_hists (replace in the macro with your project location): /cache/halld/gluex_simulations/REQUESTED_MC/phi2pi_17_20230221025001pm/root/monitoring_hists/ /cache/halld/gluex_simulations/REQUESTED_MC/phi2pi_18_20230321014504pm/root/monitoring_hists/ /cache/halld/gluex_simulations/REQUESTED_MC/phi2pi_18l_20230321014210pm/root/monitoring_hists/