Closed jrstevenjlab closed 1 year ago
To attempt to reproduce this issue, I carried out the following steps:
Followup question: it seems like MCwrapper just assumes that the randoms files will be present on the /cache disk. What does MCwrapper do to check this? If they have been removed from cache, does it know what to do? -Richard
The files are “permanently pinned”. Which prevents them falling off cache. There is also a secondary copy which is rsync’d to the xrootd server.
Problems can still occur but even if I could guarantee existence anywhere an internet outage will spoil it. I could be wrong but every case of missing randoms come from the asynchronous trigger being disabled. Sean usually replaces them with the ps triggers when notified.
Thomas Britton
On May 15, 2023, at 5:04 PM, Richard Jones @.***> wrote:
Followup question: it seems like MCwrapper just assumes that the randoms files will be present on the /cache disk. What does MCwrapper do to check this? If they have been removed from cache, does it know what to do? -Richard
— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JeffersonLab_halld-5Fsim_issues_279-23issuecomment-2D1548574784&d=DwMCaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=ccFffx721N71hPpKcJGvJIqY9RM4gBTuzp9ir7rze5Q&m=KFo0cNQbZy-SVRbLLKjvB9NywyqKAyJA_qy45y_9pyfJVpQiIw_ZCmrwR6eKI9zg&s=J7j_q-7Fo08Cunx9icpYhouN-G0zBp8mKus_lOEISfk&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AFRO2BEJQN7ZW4MW3B6QRH3XGKK4TANCNFSM6AAAAAAV3XHNX4&d=DwMCaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=ccFffx721N71hPpKcJGvJIqY9RM4gBTuzp9ir7rze5Q&m=KFo0cNQbZy-SVRbLLKjvB9NywyqKAyJA_qy45y_9pyfJVpQiIw_ZCmrwR6eKI9zg&s=4owCBjH79pMaB55p_Sop_9FliUzW1PHn9tFputNwlcI&e=. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Zooming in on this one run 30460 (see Alexander's comment above) I see this run looks normal in my test.
Thomas, why do you say these drop-out runs are missing async randoms triggers? See this run 30460 that Alexander highlights above. Here is the randoms file for that run on my disk. It is certainly not missing. -rw-r--r-- 1 gluex Gluex 547848739 May 8 13:40 run030460_random.hddm
On the other hand, this run is missing from the cache disk. Maybe the "pinning" is not working? ifarm1802.jlab.org> ls /cache/halld/gluex_simulations/random_triggers/recon-2017_01-ver03.2/run030460_random.hddm ls: cannot access /cache/halld/gluex_simulations/random_triggers/recon-2017_01-ver03.2/run030460_random.hddm: No such file or directory ifarm1802.jlab.org> ls /mss/halld/gluex_simulations/random_triggers/recon-2017_01-ver03.2/run030460_random.hddm /mss/halld/gluex_simulations/random_triggers/recon-2017_01-ver03.2/run030460_random.hddm
It looks like that is the case. I’ll talk to ying tomorrow. Also after the 2FA I may need to redo the rsync….adding it to the list
Thomas Britton
On May 15, 2023, at 5:47 PM, Richard Jones @.***> wrote:
On the other hand, this run is missing from the cache disk. Maybe the "pinning" is not working? ifarm1802.jlab.org> ls /cache/halld/gluex_simulations/random_triggers/recon-2017_01-ver03.2/run030460_random.hddm ls: cannot access /cache/halld/gluex_simulations/random_triggers/recon-2017_01-ver03.2/run030460_random.hddm: No such file or directory ifarm1802.jlab.org> ls /mss/halld/gluex_simulations/random_triggers/recon-2017_01-ver03.2/run030460_random.hddm /mss/halld/gluex_simulations/random_triggers/recon-2017_01-ver03.2/run030460_random.hddm
— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JeffersonLab_halld-5Fsim_issues_279-23issuecomment-2D1548638619&d=DwMCaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=ccFffx721N71hPpKcJGvJIqY9RM4gBTuzp9ir7rze5Q&m=823RrtUUNLU-3YOQcScNoBpVBMktFBZHOXerk0mqgDCwRHNEMmPehaJqKx-qNxAI&s=kOMonumI4uMuRUyS02KTc7bl4ci0LyUuys6nVb90YH0&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AFRO2BBOAVXBHWTFB2PRVA3XGKP6XANCNFSM6AAAAAAV3XHNX4&d=DwMCaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=ccFffx721N71hPpKcJGvJIqY9RM4gBTuzp9ir7rze5Q&m=823RrtUUNLU-3YOQcScNoBpVBMktFBZHOXerk0mqgDCwRHNEMmPehaJqKx-qNxAI&s=-4GDcT1YkjFukSe-qVhLjze4p-V0xZkY8gMBoXCnlWU&e=. You are receiving this because you commented.Message ID: @.***>
I am putting these thoughts here for our meeting later: if we have runs where TOF hits are not present and both hdgeant and mcsmear output are saved should we not be using the hdgeant output as input to mcsmear. Run mcsmear a bunch of times and see if we can catch the behavior again? This would take MCwrapper out of the equation (I am unsure on a mechanism to cause this in MCwrapper). We can use Justin's and Alex's observations to estimate rate and then do the insane thing of running the same code a bunch of time expecting a different outcome. If we catch the behavior the problem is in mcsmear (probably rng related) and if we have a large enough statistical sample and do not observe the behavior it is probably something in MCwrapper. It is good that tests were run not on OSG that observed the behavior as that removes a large number of potential "weirdness" which could interfere in mysterious ways.
Thomas, here are the changes I had to make to gluex_MCwrapper to make the jobs run on the UConn cluster.
@.*** gluex_MCwrapper]$ git status
directory) #
#
The osg-container.sh in the github repo is a dummy, so I copied in my own container wrapper script, no need for further discussion of that. All of the other changes are in MakeMC.sh, shown below.
@.*** gluex_MCwrapper]$ git diff MakeMC.sh diff --git a/MakeMC.sh b/MakeMC.sh index ae58526..ef8a5c5 100755 --- a/MakeMC.sh +++ b/MakeMC.sh @@ -1,5 +1,7 @@
+source /home/halld/setup.sh +
export BATCHRUN=$1 shift @@ -192,6 +194,8 @@ export USER_BC='/usr/bin/bc' export USER_STAT='/usr/bin/stat' fi
+cd $_CONDOR_SCRATCH_DIR +
if [[ "$BATCHRUN" != "0" ]]; then @@ -209,6 +213,7 @@ if [[ "$BATCHSYS" == "QSUB" ]]; then cd $RUNNING_DIR fi
+echo running in workdir $(pwd)
if [[ ! -d $RUNNING_DIR/${RUNNUMBER}${FILE_NUMBER} ]]; then @@ -645,7 +650,8 @@ if [[ "$BKGFOLDSTR" == "DEFAULT" || "$bkgloc_pre" == "loc:" || "$BKGFOLDSTR" ==
bkglocstring="/work/osgpool/halld/random_triggers/"$RANDBGTAG"/run"$formatted_runNumber"_random.hddm" +
+
bkglocstring="$XRD_RANDOMS_URL/random_triggers/"$RANDBGTAG"/run"$formatted_runNumber"_random.hddm"
if [[ hostname
== '
scosg16.jlab.org' || hostname
== 'scosg20.jlab.org' || hostname
== '
scosg2201.jlab.org' ]]; then
bkglocstring="/work/osgpool/halld/random_triggers/"$RANDBGTAG"/run"$formatted_runNumber"_random.hddm" fi
On Thu, May 25, 2023 at 7:44 AM T-Britton @.***> wrote:
I am putting these thoughts here for our meeting later: if we have runs where TOF hits are not present and both hdgeant and mcsmear output are saved should we not be using the hdgeant output as input to mcsmear. Run mcsmear a bunch of times and see if we can catch the behavior again? This would take MCwrapper out of the equation (I am unsure on a mechanism to cause this in MCwrapper). We can use Justin's and Alex's observations to estimate rate and then do the insane thing of running the same code a bunch of time expecting a different outcome. If we catch the behavior the problem is in mcsmear (probably rng related) and if we have a large enough statistical sample and do not observe the behavior it is probably something in MCwrapper. It is good that tests were run not on OSG that observed the behavior as that removes a large number of potential "weirdness" which could interfere in mysterious ways.
— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_sim/issues/279#issuecomment-1562759752, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWCIMRKMW5J7GTAIQGDXH5A2NANCNFSM6AAAAAAV3XHNX4 . You are receiving this because you commented.Message ID: @.***>
Hello all,
With the latest version of MCwrapper, I went back and ran the full set of simulation runs that Alexander showed in his demonstration test, 338 files in all. I do not see any examples of dropping tof hits in my complete sample (see second plot below) whereas Alex sees 17 (first plot below). If this is a random thing, the probability that it is there in my UConn instance and just not seen by statistical fluctuation is exp(-17), which is negligible. So either it is not random, or I am very (un)lucky.
-Richard Jones
Hi Richard,
just to confirm: your randoms come also from JLab, right (export XRD_RANDOMS_URL=root://sci-xrootd.jlab.org//osgpool/halld/)? Or did the JLab connection test fail and yours were all streamed from UConn?
I can't remember how the files get to root://nod25.phys.uconn.edu/Gluex/rawdata/, ie. if they are exactly the same as the ones at JLab or could be older versions.
Cheers, Peter
PS I couldn't see the two plots you attached, was that just me?
Can you share your MC.config file, Richard?
I also can't see the plots you refer to in the post Richard.
Peter,
It gets the randoms file from wherever MCwrapper looks for it. I didn't give it any hints about UConn stash locations, but it may know about them. Here is a line from the gluexMC.py script in MCwrapper that seems relevant.
os.system("scp sci-xrootd.jlab.org:/osgpool/halld/"+"/random_triggers/"+RANDBGTAG+"/run"+formattedRUNNUM+"_random.hddm /tmp/"+RANDBGTAG)
The version of the randoms is identified by the full directory path, which ends with recon-2017_01-ver03.2, this ver03.2 is a unique identifier of the source of these randoms, I believe. I preserved it when I make copies from Jlab at UConn.
-Richard Jones
On Fri, May 26, 2023 at 7:46 AM Peter Hurck @.***> wrote:
Hi Richard,
just to confirm: your randoms come also from JLab, right (export XRD_RANDOMS_URL=root://sci-xrootd.jlab.org//osgpool/halld/)? Or did the JLab connection test fail and yours were all streamed from UConn?
I can't remember how the files get to root:// nod25.phys.uconn.edu/Gluex/rawdata/, ie. if they are exactly the same as the ones at JLab or could be older versions.
Cheers, Peter
PS I couldn't see the two plots you attached, was that just me?
— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_sim/issues/279#issuecomment-1564268921, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWGILXRWESNZUSE26MLXICJ2TANCNFSM6AAAAAAV3XHNX4 . You are receiving this because you commented.Message ID: @.***>
Here is my MC.config file, which I took from Alexander and made minimal changes for the UConn cluster context.
here with or without calibtime Default is variation=mc
is defaulted to ./. Use only when NEEDED
ccdbSQLITEPATH=/home/halld/HDGeant4/jlab/test/makeMC/ccdb_NewAlignment_Vertex.sqlite rcdbSQLITEPATH=/cvmfs/ oasis.opensciencegrid.org/gluex/group/halld/www/halldweb/html/dist/rcdb.sqlite
@._production and @status_approved and (polarization_angle==0.0 or polarization_angle==90.0 or polarization_angle==45.0 or polarization_angle==135.0) @._production and @status_approved
DATA_OUTPUT_BASE_DIR=/home/halld/HDGeant4/jlab/test
NCORES=4 # Number of CPU threads to use or nodes:node-id:ppn or nodes:ppn depending on your system
GENERATOR=gen_amp #or you may specifile file:/.../file-to-use.hddm GENERATOR_CONFIG=/home/halld/HDGeant4/jlab/test/makeMC/gen_2k_4_4.cfg
chosen run number from the rcdb
for the chosen run number from the rcdb
chosen run number from the rcdb
GEN_MIN_ENERGY=8.2 GEN_MAX_ENERGY=8.8
GEANT_VERSION=4 BKG=Random:recon-2017_01-ver03.2 #[None, BeamPhotons, TagOnly, custom e.g bg.hddm:1.8] Can be stacked eg Random+TagOnly:.123 where the :[num] defines BGRATE
stacked eg Random+TagOnly:.123 where the :[num] defines BGRATE
hd_root. This should be a comma separated list (e.g. plugin1,plugin2) CUSTOM_PLUGINS= file:/home/halld/HDGeant4/jlab/test/makeMC/jana_analysis_2k.config
jana#====================================================================================
BATCH_SYSTEM=condor #can be swif or condor or osg or qsub adding :[name] will pass -q [name] into PBS.
ENVIRONMENT_FILE=/cvmfs/ oasis.opensciencegrid.org/gluex/group/halld/www/halldweb/html/halld_versions/recon-2019_11-ver01_8.xml
ANA_ENVIRONMENT_FILE=/cvmfs/ oasis.opensciencegrid.org/gluex/group/halld/www/halldweb/html/halld_versions/analysis-2017_01-ver61.xml
WORKFLOW_NAME=2k_mc #SWIF WORKFLOW NAME PROJECT = halld # http://scicomp.jlab.org/scicomp/#/projects TRACK= production # https://scicomp.jlab.org/docs/batch_job_tracks
DISK=5GB # Max Disk usage RAM=5GB # Max RAM usage TIMELIMIT=300minutes # Max walltime. This may be of the form xx:xx:xx depending on your system OS=general # Specify CentOS65 machines
On Fri, May 26, 2023 at 8:09 AM Peter Hurck @.***> wrote:
Can you share your MC.config file, Richard?
— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_sim/issues/279#issuecomment-1564296520, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWDTAOWLWOQEIILZ35DXICMO3ANCNFSM6AAAAAAV3XHNX4 . You are receiving this because you commented.Message ID: @.***>
Making a new entry, copied from above with plots. I guess github does not accept email attachments in replies.
Thanks for posting these plots Richard. I find it interesting that even for the runs where Alex sees TOF hits, you have more. Your average is systematically above 10^2 Alex has systematically less than 10^2 hits. I wonder if that points to something more fundamental then just a "random" issue in some files.
One more request: Can you provide me with an MCwrapper log file?
Peter,
There are many possible interpretations of the MCwrapper log file, which one of these:
-Richard Jones
On Fri, May 26, 2023 at 9:23 AM Peter Hurck @.***> wrote:
One more request: Can you provide me with an MCwrapper log file?
— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_sim/issues/279#issuecomment-1564389502, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWFRVM5RSGF7W44Z47DXICVFFANCNFSM6AAAAAAV3XHNX4 . You are receiving this because you commented.Message ID: @.***>
2./3.
having the terminal command you used to run everything could also be useful
That may be because I am using the special sqlite file ccdb_NewAlignment_Vertex.sqlite whereas Alexander is hitting the ccdb live server in his jobs, according to the job logs. If I try to do that, after the first 100 or so jobs start up, the external-facing mysql server stops talking to me. -Richard Jones
On Fri, May 26, 2023 at 9:19 AM Peter Hurck @.***> wrote:
Thanks for posting these plots Richard. I find it interesting that even for the runs where Alex sees TOF hits, you have more. Your average is systematically above 10^2 Alex has systematically less than 10^2 hits. I wonder if that points to something more fundamental then just a "random" issue in some files.
— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_sim/issues/279#issuecomment-1564384180, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWDVUTUHQOZSTJFVHQ3XICUU3ANCNFSM6AAAAAAV3XHNX4 . You are receiving this because you commented.Message ID: @.***>
there is a copy that, provided you don't supply a special file, gets used out on the osg.
export CCDB_CONNECTION=sqlite:////group/halld/www/halldweb/html/dist/ccdb.sqlite (note this is the oasis group...group as seen from the container)
Here is a sample stdout and stderr file from a typical run of Alexander's test on the uconn site. The command I used to initiate the jobs is below. -Richard
$MCWRAPPER_CENTRAL/gluex_MC.py MC_2k.config 30274-31057 10000000 batch=2 cleangeant=0 cleanmcsmear=0
Hi Richard,
I just noticed a couple possibly relevant differences in your log file in terms of software versions
From your log file:
Environment file: /cvmfs/oasis.opensciencegrid.org/gluex/group/halld/www/halldweb/html/halld_versions/recon-2019_11-ver01_8.xml Analysis Environment file: /cvmfs/oasis.opensciencegrid.org/gluex/group/halld/www/halldweb/html/halld_versions/analysis-2017_01-ver61.xml
From Alex's log file /work/halld2/home/aaustreg//Analysis/rho/simulation/sdme/ifarm_2017_ver03_ver50_keep_all/log/30460_stdout.30460_0.out:
Environment file: /group/halld/www/halldweb/html/halld_versions/recon-2017_01-ver03_34.xml Analysis Environment file: /group/halld/www/halldweb/html/halld_versions/analysis-2017_01-ver50.xml
This may not be the cause of the missing hits but could be related to the systematic difference average TOF hits for the good runs. Could you retry at least a subset of the test with the match versions to Alex's log file?
-Justin
Ok, I can do that. I also see differences in the versions of the GlueX software being used. Before I rerun, I would like to figure out how to get MCwrapper to start the same versions of the software as it is giving Alex.
Alex: =======SOFTWARE USED======= MCwrapper version v2.6.1 MCwrapper location /group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/gluex_MCwrapper/gluex_MCwrapper-v2.6.2 Streaming via xrootd? 1 Event Count: 219324 BC /usr/bin/bc python /apps/bin/python /group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/halld_sim/halld_sim-4.43.0^rec170139/Linux_CentOS7.7-x86_64-gcc4.8.5/bin/gen_amp /group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/hdgeant4/hdgeant4-2.34.0^rec170139/bin/Linux-g++/hdgeant4 /group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/halld_sim/halld_sim-4.43.0^rec170139/Linux_CentOS7.7-x86_64-gcc4.8.5/bin/mcsmear /group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/halld_recon/halld_recon-recon-2017_01-ver03.9/Linux_CentOS7.7-x86_64-gcc4.8.5/bin/hd_root
me: =======SOFTWARE USED======= MCwrapper version v2.7.0 MCwrapper location /home/halld/gluex_MCwrapper LDPRELOAD: /usr/lib64/libXrdPosixPreload.so Streaming via xrootd? 1 Event Count: -1 BC /usr/bin/bc python /usr/bin/python /home/halld/halld_sim/Linux_CentOS7-x86_64-gcc4.8.5/bin/gen_amp /home/halld/HDGeant4/jlab/bin/Linux-g++/hdgeant4 /home/halld/halld_sim/Linux_CentOS7-x86_64-gcc4.8.5/bin/mcsmear /home/halld/halld_recon/Linux_CentOS7-x86_64-gcc4.8.5/bin/hd_root
Maybe Alex can post his MC config file. Then just a few modifications (eg paths) but you will know all the files that are being referenced.
That environment file can be an xml. If MCwrapper sees a .xml is assumes build scripts and will try to find it in /group/ (oasis if on OSG).
Message ID: @.***>
For the test above, I used the following xml files:
ENVIRONMENT_FILE=/group/halld/www/halldweb/html/halld_versions/recon-2017_01-ver03_34.xml
ANA_ENVIRONMENT_FILE=/group/halld/www/halldweb/html/halld_versions/analysis-2017_01-ver50.xml
Since then, I have repeated the test with newer version, but experienced the same issue in a few cases:
ENVIRONMENT_FILE=/group/halld/www/halldweb/html/halld_versions/recon-2017_01-ver03_35.xml
ANA_ENVIRONMENT_FILE=/group/halld/www/halldweb/html/halld_versions/analysis-2017_01-ver60.xml
Going back to the actual binaries that were used in Alex's demonstration of this issue, I am now able to reproduce it at UConn. See below, there are actually 3 files out of 1138 that produced the mean tof hits count around 5. I now need to figure out what these builds are formed out of, I presumed tagged versions? -Richard Jones
Great!
Message ID: @.***>
They are indeed tagged versions. I think Alex’s post of the MC.config will have the xml used…which lists all the versions. I can try to pull up the list in a bit
more /group/halld/www/halldweb/html/halld_versions/recon-2017_01-ver03_34.xml ?xml version="1.0" encoding="UTF-8"?> ?xml-stylesheet type="text/xsl" href="https://halldweb.jlab.org/halld_versions/version7.xsl"?> gversions file="recon-2017_01-ver03_34.xml" date="2022-11-23"> description>Recon launch compatible version based on 5.9.0 package name="amptools" version="0.14.5" dirtag="root60806"/> package name="ccdb" version="1.06.07"/> package name="cernlib" version="2005" word_length="64-bit"/> package name="diracxx" version="2.0.1"/> package name="evio" version="4.4.6"/> package name="evtgen" version="01.07.00"/> package name="geant4" version="10.02.p02"/> package name="gluex_MCwrapper" version="v2.6.2"/> package name="gluex_root_analysis" version="1.23.0" dirtag="rec170139"/> package name="halld_recon" version="recon-2017_01-ver03.9"/> package name="halld_sim" version="4.43.0" dirtag="rec170139"/> package name="hdds" version="4.6.0" dirtag="bs221"/> package name="hdgeant4" version="2.34.0" dirtag="rec170139"/> package name="hd_utilities" version="1.44"/> package name="hepmc" version="2.06.10"/> package name="jana" version="0.7.9p1" dirtag="bs221"/> package name="lapack" version="3.6.0"/> package name="photos" version="3.61"/> package name="rcdb" version="0.06.00"/> package name="root" version="6.08.06" dirtag="bs221"/> package name="sqlitecpp" version="2.2.0" dirtag="bs130"/> package name="sqlite" version="3.13.0" year="2016" dirtag="bs130"/> package name="xerces-c" version="3.1.4"/> /gversions>
note the mcwrapper version cited here isn't used for submitting...only for ancillary files needed by some generators
There is a longstanding bug in TOFSmear.cc, coming from uninitialized calibration constants in the tof_calib_t object. It happens because the compiler switch DTOFGEOEMTRY is not initialized, and the section of code that is selected when DTOFGEOMETRY is undefined does not load values from ccdb into certain constants in tof_calib_t (), nor does it assign default values for these constants, with the result that they take on whatever values were there in memory when the constructor was called. This affects all simulations of the TOF, not just the small handful of runs where the simulation tof hits are missing altogether. It does not usually kill off the TOF hits altogether, usually it just applies an arbitrary attenuation correction. If we don't use the TOF pulse height for anything, it may not have a big impact.
In looking for this bug, I found a couple of others too, related to the drift chamber hits. They may be less serious, but I am not sure. I think I will submit them one by one, as separate pull requests, since they can have different effects.
As a side comment, debugging this was made more difficult by the fact that all debug symbols are being stripped from the binaries in the container. This means that I had to disassemble the compiled code and follow the execution at the machine instruction level to find the bug. It did not occur in the code that was compiled with symbols enabled, at least not at the same places. The advantage is that I have now refreshed my memory of the x86_64 instruction set, and what the main cpu registers are used for by the gcc optimizer. Unfortunately I will forget it all again tomorrow.
Thanks a lot for tracking this down Richard and the PR #287
So if I understand properly, arbitrary attenuation lengths have been used for 2017-2018 simulation since the attenuation length was added to the TOF energy deposition in July 2020 https://github.com/JeffersonLab/halld_sim/commit/c5a1b0f771dea9f1fd80f8b18a8973cc7b57ef04
We don't use the TOF dE for much in user analysis but there is a threshold applied when building DTOFPoints
https://github.com/JeffersonLab/halld_recon/blob/master/src/libraries/TOF/DTOFPoint_factory.cc#L47
that would be affected by this but since the hits may be missing anyway based on the cut in mcsmear, so I don't think its a big impact from the wrong dE values beyond the missing hits.
That said, these missing hits could create an inefficiency in charged tracks reconstruction from worse start times or not having an associated fast detector match that's required in the default analysis library. For analyses that rely on the TOF for strict PID the effect could be more significant.
I have now submitted a second, independent, pull request for a second bug fix. This one has been there since the introduction of random hits merging a long time ago. Every few events in 10k that I looked at, there was a situation in the random hits merging (usually tagger or drift chamber hits) where the algorithm was accessing memory past the end of an array. The results are unpredictable, but they should only affect a few events per 10k. Still, it should be fixed. This PR has been tested and shown to fix this problem.
Thanks a lot for tracking this down, Richard. One note - I think this new pull request, #288, still includes the proposed changes to the TOF smearing classes.
One note - I think this new pull request, #288 https://github.com/JeffersonLab/halld_sim/pull/288, still includes the proposed changes to the TOF smearing classes.
Yes, the two PRs are sequential, if you like them both you can execute them at once, or just the first one with changes, and I can resubmit the second one. -Richard J.
On Thu, Jun 8, 2023 at 4:59 PM Sean Dobbs @.***> wrote:
Thanks a lot for tracking this down, Richard. One note - I think this new pull request, #288 https://github.com/JeffersonLab/halld_sim/pull/288, still includes the proposed changes to the TOF smearing classes.
— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_sim/issues/279#issuecomment-1583334336, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWDC72HX3SWXRJWCHQ3XKI4KZANCNFSM6AAAAAAV3XHNX4 . You are receiving this because you commented.Message ID: @.***>
Ok, just to summarise: Sean removed the switches from the TOF in mcsmear in PR #289. This supersedes Richard's PR #287. But we still need to include the fixes in PR #288 (minus the 287 parts). This requires Richard to resubmit the second bugfix. Sean made the necessary change to recon-2017_01-ver03-sim so we can tag a new version of the old 2017_01-ver03 recon code. Changes to newer versions are not required as this is already included (albeit coded slightly differently, e.g. https://github.com/JeffersonLab/halld_recon/blob/cf30b83b743e9216cb06ce8a0b28291447804d08/src/libraries/TOF/DTOFGeometry.h#L33)
So, as soon as we merge Richard's bug fix we can tag new versions and make new builds for MC production. All correct?
Thanks for the summary @s6pepaul. That's my understanding with one addition: the function DTOFGeometry::Get_CCDB_DirectoryName() needed to be added to both recon-2018_01-ver02-sim and recon-2018_08-ver02-sim, which @sdobbs did yesterday. So I believe all of the 2017 and 2018 sim branches of halld_recon need new tags and that could be done in parallel to Richard's resubmission of #288
Very good, my PR for the fix to the hits merging bug has been updated, should be ready to go as soon as it passes checks. -Richard Jones
On Tue, Jun 13, 2023 at 7:44 AM Justin Stevens @.***> wrote:
Thanks for the summary @s6pepaul https://github.com/s6pepaul. That's my understanding with one addition: the function DTOFGeometry::Get_CCDB_DirectoryName() needed to be added to both recon-2018_01-ver02-sim and recon-2018_08-ver02-sim, which @sdobbs https://github.com/sdobbs did yesterday. So I believe all of the 2017 and 2018 sim branches of halld_recon need new tags and that could be done in parallel to Richard's resubmission of #288 https://github.com/JeffersonLab/halld_sim/pull/288
— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_sim/issues/279#issuecomment-1589139390, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWGPQJVOLVYGZHGQ3NLXLBHCDANCNFSM6AAAAAAV3XHNX4 . You are receiving this because you commented.Message ID: @.***>
Thanks @rjones30, #288 is merged. Looks like we're ready for new tagged versions.
With the patched halld_recon branch and the latest halld_sim master here is a short test with 10M rho events from all runs in the 2017_01 period (similar to Alex's original test). Plotted is the average # of TOF points for each run in the sample, which doesn't show any outlier runs with very few hits that was the original symptom of the TOF hits being missing from the signal MC. We should still check some larger samples that are produced by MCWrapper with new tagged versions, but I'm closing this issue at this point.
Output files can be found at: /work/halld2/home/jrsteven/analysisGluexI/tof_test/recon-2017_01-ver03-sim/
Thanks a lot to everyone involved in tracking this down!
Closing the loop on this: @s6pepaul tagged new version sets which fix this issue that should be used for any new MC production https://mailman.jlab.org/pipermail/halld-offline/2023-June/008917.html
For analyzers who would like to check if this issue affects their existing simulation samples, you can use a ROOT macro to plot TOF hits vs run number from the standard monitoring_hists produced by MCWrapper. An example of such a macro can be found at /work/halld2/home/jrsteven/forTOF_test/plotTOF_MissingHits.C
Below are results from some recent samples which show the problem affects 2017 and 2018 periods and varies a lot from run to run. The 2018_08 period for these samples appears to be particularly affected, but this is a random effect from an uninitialized variable so no generalizations should be made.
Paths for monitoring_hists (replace in the macro with your project location): /cache/halld/gluex_simulations/REQUESTED_MC/phi2pi_17_20230221025001pm/root/monitoring_hists/ /cache/halld/gluex_simulations/REQUESTED_MC/phi2pi_18_20230321014504pm/root/monitoring_hists/ /cache/halld/gluex_simulations/REQUESTED_MC/phi2pi_18l_20230321014210pm/root/monitoring_hists/
Looking at a recent MCWrapper project I noticed that certain run numbers had significantly fewer TOFPoints than others. The figure below shows the number of TOFPoints/event for a 4 nearby runs, where the upper panels show a reasonable number of TOFPoints on average, but the bottom two panels show runs where there are very few TOFPoints reconstructed
exampleTOFpoints_phi2pi_17_2023.pdf
The mean TOFPoints/event is plotted vs run number in the figure below where its clear there are large numbers of runs in the range 30614-30800, where the number of observed TOFPoints is well below the expectation
tofPointsVsRun_phi2pi_17_2023.pdf
Finally, when you look at the matching of these TOFPoints to tracks for these 4 nearby runs, the two with significant numbers of TOFPoints show correlations with tracks as expected, but those with a low number of TOFPoints are not correlated with tracks (very few have distances between track and TOFPoint < 6 cm)
exampleTOFmatch_phi2pi_17_2023.pdf
This seems to indicated that the TOFPoints from random triggers are being included in the simulated events, but the TOF hits from the simulated signal events are not. This appears to occur in several other MCWrapper projects and also in simulations @aaust produced on the JLab farm, but typically for different ranges of run numbers in different samples. So the effect is not directly reproducible.
The plots shown above are from MCWrapper project #3093, which can be found at /cache/halld/gluex_simulations/REQUESTED_MC/phi2pi_17_20230221025001pm/root/monitoring_hists/
@aaust has a separate sample of rho events produced on the JLab farm which shows shows the same problem for an example file /work/halld2/home/aaustreg/Analysis/rho/simulation/sdme/ifarm_2017_ver03_ver50_keep_all/root/monitoring_hists/hd_root_gen_amp_030460_000.root
In this case the hdgeant4 and smeared output were saved (in the same hddm/ path as above) and the simulated TOF hits are included in the hdgeant4 file before smearing
/work/halld2/home/aaustreg//Analysis/rho/simulation/sdme/ifarm_2017_ver03_ver50_keep_all/hddm/gen_amp_030460_000_geant4.hddm
but are missing in the smeared file
/work/halld2/home/aaustreg//Analysis/rho/simulation/sdme/ifarm_2017_ver03_ver50_keep_all/hddm/gen_amp_030460_000_geant4_smeared.hddm
So, is there some mechanism for losing simulated TOF hits when we mix the simulated event with the random trigger hits in the mcsmear step? And if so, why is it run number dependent and not reproducible? Any suggestions for further tests or studies to understand this bug are appreciated