Closed rlcee closed 9 months ago
@rlcee, yes could you send instructions for us to reproduce the error?
source /cvmfs/mu2e.opensciencegrid.org/setupmu2e-art.sh
git clone https://github.com/Mu2e/Offline
git clone https://github.com/Mu2e/Production
muse setup
muse build -j 20
scp mu2egpvm01:/mu2e/data/personal/rlc/shReco.fcl .
mu2e -c shReco.fcl -s /pnfs/mu2e/tape/phy-rec/rec/mu2e/TRK_VST-cosmics/vst002-r01/art/bf/56/rec.mu2e.TRK_VST-cosmics.vst002-r01.100004_00000000.art -o temp.art
the exe runs 1h and the error occurs at endjob
This is a wild guess and may be completely irrelevant. But just in case:
Ray: are these the events with very sparse event numbers? If so, is the problem alleviated if you use compactEventRanges when reading the files?
compactEventRanges is involved. I take the raw binary file, and using a custom source module, make X.art, which has 20% of its disk space used in the RootFileDB. I then run artcat (a fcl which just reads and writes events) with compactEventRanges set, and write Y.art, which has a small db. It is Y.art which is the input to this exe. I processed 25 other files like this and did not get this error, this is the only file. It is also the largest file I've processed, so the size is where I'd look first. The event numbers are very sparse, with only one run and subrun number.
Some working notes before running the job:
$ count_events --hr rec.mu2e.TRK_VST-cosmics.vst002-r01.100004_00000000.art
rec.mu2e.TRK_VST-cosmics.vst002-r01.100004_00000000.art 1 run, 1 subrun, 6820422 events, and 0 results.
Counted events successfully for 1 specified files.
$ file_info_dumper --db-to-file rec.mu2e.TRK_VST-cosmics.vst002-r01.100004_00000000.art
$ mv db rootfile.db
$ for table in `sqlite3 rootfile.db '.tables'`; do size=$(sqlite3 rootfile "select count(*) from ${table}"); echo $size $table; done
1 EventRanges
1 RunRangeSets_EventRanges
13 FileCatalog_metadata
1 SubRunRangeSets
91 ParameterSets
1 SubRunRangeSets_EventRanges
1 RunRangeSets
The input RootFileDB
is thus fairly modest in size.
$ file_info_dumper --range-of-validity full rec.mu2e.TRK_VST-cosmics.vst002-r01.100004_00000000.art
==============================
File: rec.mu2e.TRK_VST-cosmics.vst002-r01.100004_00000000.art
Representation: full (default)
------------------------------
Run: 100004
SubRun: 0 Event range: [616,243764655)
The range-of-validity is very compact.
The above job results in this error:
%MSG-s ArtException: PostEndJob 23-Nov-2021 15:13:56 CST ModuleEndJob
---- Configuration BEGIN
The process name strawHitReco was previously used on these products.
Please modify the configuration file to use a distinct process name.
---- Configuration END
%MSG
Art has completed and will exit with status 1.
Will modify the process name to test
.
Ah, sorry, the process name problems makes the point that I grabbed the wrong file. The rec.. file is the output, please use this as the input file: /pnfs/mu2e/tape/phy-raw/raw/mu2e/TRK_VST-cosmics/vst002/art/64/41/raw.mu2e.TRK_VST-cosmics.vst002.100004_00000000.art sorry!
Thanks, @rlcee. Will rerun.
@rlcee, I am unable to reproduce the failure on our development machine:
$ art -c full.fcl -o test-compact.root
Conditions file: /scratch/knoepfel/mu2e-sqlite-bug/Offline/ConditionsService/data/conditions_01.txt
Conditions lines: 75 hash: 8230064490884710176
DbService TRACKER_VST v1_2 mu2e_conditions_prd
DbService textFiles:
GlobalConstants file: /scratch/knoepfel/mu2e-sqlite-bug/Offline/GlobalConstantsService/data/globalConstants_01.txt
GlobalConstants lines: 128 hash: 10648863269909984210
--------------- HepPDT Version 3.04.01 ---------------
found 534 particles
--------------- HepPDT Version 3.04.01 ---------------
found 275 particles
found 3116 particles
24-Nov-2021 10:49:25 CST Initiating request to open input file "raw.mu2e.TRK_VST-cosmics.vst002.100004_00000000.art"
24-Nov-2021 10:49:25 CST Opened input file "raw.mu2e.TRK_VST-cosmics.vst002.100004_00000000.art"
%MSG-w PROVENANCE: TriggerResultInserter:TriggerResults@Construction 24-Nov-2021 10:49:36 CST ModuleConstruction
Source parameter compactEventRanges was set to true: enabling compact event ranges
creates a history that can cause file concatenation problems if a given SubRun spans
multiple input files. Use with care.
%MSG
Geometry file: /scratch/knoepfel/mu2e-sqlite-bug/Offline/Mu2eG4/geom/geom_common.txt
Geometry lines: 18377 hash: 3636970994802260716
DbReader::fillValCache took 0.584268 s
DbEngine confirmed purpose and version TRACKER_VST 1/2/-1
DbEngine will use max extension 3
DbEngine found 3 groups and 24 IOV for 11 tables
DbEngine inclusive beginJob time was 0.584647 s
Begin processing the 1st record. run: 100004 subRun: 0 event: 616 at 24-Nov-2021 10:49:52 CST
24-Nov-2021 10:49:52 CST Opened output file with pattern "test-compact.root"
Begin processing the 2nd record. run: 100004 subRun: 0 event: 646 at 24-Nov-2021 10:49:52 CST
Begin processing the 3rd record. run: 100004 subRun: 0 event: 726 at 24-Nov-2021 10:49:52 CST
Begin processing the 5th record. run: 100004 subRun: 0 event: 769 at 24-Nov-2021 10:49:52 CST
Begin processing the 9th record. run: 100004 subRun: 0 event: 973 at 24-Nov-2021 10:49:52 CST
Begin processing the 17th record. run: 100004 subRun: 0 event: 1104 at 24-Nov-2021 10:49:52 CST
Begin processing the 33rd record. run: 100004 subRun: 0 event: 1605 at 24-Nov-2021 10:49:52 CST
Begin processing the 65th record. run: 100004 subRun: 0 event: 2652 at 24-Nov-2021 10:49:52 CST
Begin processing the 129th record. run: 100004 subRun: 0 event: 4544 at 24-Nov-2021 10:49:52 CST
Begin processing the 257th record. run: 100004 subRun: 0 event: 9110 at 24-Nov-2021 10:49:52 CST
Begin processing the 513th record. run: 100004 subRun: 0 event: 17903 at 24-Nov-2021 10:49:53 CST
Begin processing the 1025th record. run: 100004 subRun: 0 event: 35526 at 24-Nov-2021 10:49:53 CST
Begin processing the 2049th record. run: 100004 subRun: 0 event: 68003 at 24-Nov-2021 10:49:53 CST
Begin processing the 4097th record. run: 100004 subRun: 0 event: 138306 at 24-Nov-2021 10:49:54 CST
Begin processing the 8193rd record. run: 100004 subRun: 0 event: 274689 at 24-Nov-2021 10:49:56 CST
Begin processing the 16385th record. run: 100004 subRun: 0 event: 541156 at 24-Nov-2021 10:50:00 CST
Begin processing the 32769th record. run: 100004 subRun: 0 event: 1093169 at 24-Nov-2021 10:50:08 CST
Begin processing the 65537th record. run: 100004 subRun: 0 event: 2184761 at 24-Nov-2021 10:50:23 CST
Begin processing the 131073rd record. run: 100004 subRun: 0 event: 4354406 at 24-Nov-2021 10:50:54 CST
Begin processing the 262145th record. run: 100004 subRun: 0 event: 8703499 at 24-Nov-2021 10:51:55 CST
Begin processing the 524289th record. run: 100004 subRun: 0 event: 17506083 at 24-Nov-2021 10:53:58 CST
Begin processing the 1048577th record. run: 100004 subRun: 0 event: 34570216 at 24-Nov-2021 10:58:04 CST
Begin processing the 2097153rd record. run: 100004 subRun: 0 event: 70775089 at 24-Nov-2021 11:06:44 CST
Begin processing the 4194305th record. run: 100004 subRun: 0 event: 146507330 at 24-Nov-2021 11:23:17 CST
24-Nov-2021 11:44:16 CST Closed output file "test-compact.root"
24-Nov-2021 11:44:16 CST Closed input file "raw.mu2e.TRK_VST-cosmics.vst002.100004_00000000.art"
================================================================================================================================
TimeTracker printout (sec) Min Avg Max Median RMS nEvts
================================================================================================================================
Full event 0.00029886 0.000435749 0.0723855 0.000354407 0.00044723 6820422
--------------------------------------------------------------------------------------------------------------------------------
source:RootInput(read) 4.6897e-05 5.40729e-05 0.000956523 5.1309e-05 1.14399e-05 6820422
p1:PBTFSD:ProtonBunchTimeFromStrawDigis 2.8465e-05 3.27951e-05 0.00461781 3.1308e-05 8.08617e-06 6820422
p1:makeSH:StrawHitReco 3.6555e-05 4.23785e-05 0.00464055 3.9857e-05 1.33639e-05 6820422
p1:makePH:CombineStrawHits 9.054e-06 1.05354e-05 0.00101552 1.0215e-05 1.78173e-06 6820422
p1:FlagBkgHits:FlagBkgHits 1.8437e-05 3.36267e-05 0.00054504 3.8628e-05 9.58938e-06 6820422
[art]:TriggerResults:TriggerResultInserter 4.88e-06 5.6965e-06 0.000244168 5.548e-06 7.46937e-07 6820422
end_path:outfile:RootOutput 3.342e-06 4.0227e-06 0.000242044 3.907e-06 6.19641e-07 6820422
end_path:outfile:RootOutput(write) 8.2478e-05 0.000162243 0.0719405 9.2108e-05 0.00042238 6820422
================================================================================================================================
DbEngine::endJob
Total time in reading DB: 1.21966 s
Total time waiting for locks: 3e-06 s
Total time in locks: 0.690771 s
cache memory : 500589 b
valcache memory: 20698 b
TrigReport ---------- Event summary -------------
TrigReport Events total = 6820422 passed = 6820422 failed = 0
TrigReport ---------- Trigger-path summary ------------
TrigReport Path ID Run Passed Failed Error Name
TrigReport 0 6820422 6820422 0 0 p1
TrigReport ---------- End-path summary ---------
TrigReport Run Success Error
TrigReport 6820422 6820422 0
TrigReport ---------- Modules in path: p1 ------------
TrigReport Path ID Visited Passed Failed Error Name
TrigReport 0 6820422 6820422 0 0 PBTFSD
TrigReport 0 6820422 6820422 0 0 makeSH
TrigReport 0 6820422 6820422 0 0 makePH
TrigReport 0 6820422 6820422 0 0 FlagBkgHits
TrigReport ---------- Modules in End-path ----------
TrigReport Run Success Error Name
TrigReport 6820422 6820422 0 outfile
TrigReport ---------- Module summary ------------
TrigReport Visited Run Passed Failed Error Name
TrigReport 6820422 6820422 6820422 0 0 FlagBkgHits
TrigReport 6820422 6820422 6820422 0 0 PBTFSD
TrigReport 6820422 6820422 6820422 0 0 makePH
TrigReport 6820422 6820422 6820422 0 0 makeSH
TrigReport 6820422 6820422 6820422 0 0 outfile
TimeReport ---------- Time summary [sec] -------
TimeReport CPU = 3783.421631 Real = 3787.711239
MemReport ---------- Memory summary [base-10 MB] ------
MemReport VmPeak = 2788.81 VmHWM = 2069.2
Art has completed and will exit with status 0.
See the attached configuration file, which I generated by typing:
art -c shReco.fcl raw.mu2e.TRK_VST-cosmics.vst002.100004_00000000.art -o test.root --debug-config full.fcl
I then modified the source
configuration to use compactEventRanges: true
.
@rlcee, I also ran the same job without compact event ranges, and I also had no problems. Sometimes, SQLite does not work well with nfs systems. Which machine were you using when running your job?
It was running on mu2egpvm04. In the processing of these files I did hit conditions where other users were limiting resources. Here is the link to the node monitor at the time (11/19 17:15) and it looks OK to me. monitor. Should I try it again there on 04?
Hmm...if you don't mind trying it on 04 one more time, I'd appreciate it. If it's still a problem, then I'll move over to that machine for debugging.
Running the command on mu2egpvm04 today does reproduce the error.
Thanks, @rlcee. I will take a look when I get back from vacation.
@rlcee, I have reproduced the error on mu2egpvm04
when running over all 6M events. If I run the job with 3M events instead of all 6M, I do not have a problem. I am debugging right now.
@rlcee, I believe I understand what is happening. The problem is actually not with the RootFileDB
; the problem is with the TimeTracker
database usage. Important points:
TimeTracker
for the job (--no-timing
) or give the service an explicit database name to use (--timing-db test.db
).I will provide more information later--at the very least, there is a bug in how the exception is being reported (i.e. the context of RootOutput:outfile@EndJob
in the logged message is incorrect).
@rlcee, the problem is understood. As mentioned above, this is an exception triggered by the TimeTracker
service. Notice that the time-tracker summary is missing from the end-of-job summary:
...
Begin processing the 4194305th record. run: 100004 subRun: 0 event: 146507330 at 03-Dec-2021 12:49:22 CST
03-Dec-2021 13:21:49 CST Closed output file "test.root"
03-Dec-2021 13:21:49 CST Closed input file "/pnfs/mu2e/tape/phy-raw/raw/mu2e/TRK_VST-cosmics/vst002/art/64/41/raw.mu2e.TRK_VST-cosmics.vst002.100004_00000000.art"
[[---> TimeTracker summary would go here <---]]
TrigReport ---------- Event summary -------------
TrigReport Events total = 6820422 passed = 6820422 failed = 0
TrigReport ---------- Trigger-path summary ------------
TrigReport Path ID Run Passed Failed Error Name
TrigReport 0 6820422 6820422 0 0 p1
TrigReport ---------- End-path summary ---------
TrigReport Run Success Error
TrigReport 6820422 6820422 0
...
%MSG-s ArtException: RootOutput:outfile@EndJob 03-Dec-2021 13:24:21 CST ModuleEndJob
---- SQLExecutionError BEGIN
database or disk is full
---- SQLExecutionError END
%MSG
Art has completed and will exit with status 1.
Ostensibly, this occurs due to a full disk. This is explained below.
Consider the configuration used for this job:
services.TimeTracker: { printSummary: true }
The TimeTracker
uses an SQLite database to record execution times of each module for each event. A summary of the execution times is generated by querying the database and calculating averages, maxima, minima, etc. For jobs with many events, this database becomes very large, and the query statements that calculate meaningful metrics for the summary create additional temporary tables that take up disk space.
For this particular configuration, no filename is provided. SQLite thus creates a temporary database file in the /var/tmp
subdirectory, which is removed after the last connection to the SQLite database has been closed. On mu2egpvm04
, the size of /var/tmp/
is 7.8 GB. Consider the growth of /var/tmp
throughout the course of the job:
Using df -h /var/tmp |
Used | Avail | Use% |
---|---|---|---|
Begin job | 3.5G | 4.0G | 47% |
Done processing events/before end job | 6.5G | 896M | 89% |
The size of the TimeTracker
database is thus 3.0G before end job, and during the end job stage, the TimeTracker
summary generation consumes the remaining 896M, thus ending the program with the above exception. To avoid the exception see https://github.com/art-framework-suite/art/issues/116#issuecomment-985730151.
As mentioned earlier:
[T]here is a bug in how the exception is being reported (i.e. the context of
RootOutput:outfile@EndJob
in the logged message is incorrect).
So that needs to be addressed--but that is a separate bug report. Could you fill one out for us? Specifically that the context of the logged message is incorrect.
TimeTracker
?The TimeTracker
was intended to provide rudimentary profiling information for identifying event- and module-processing bottlenecks. Recording 6M events' worth of profiling data has clearly brought it to its knees. Although we could conceive of ways to minimize the TimeTracker
footprint for jobs that only want the summary (and not the entire database file), it's not clear to me that this is the correct approach. We will discuss this issue at the next art stakeholder meeting.
Closing because the proximate issue is resolved. The exception message problem is documented in #118, and if additional discussion is required for where to place the SQLite database, we can have that at a separate time.
Received this error message
I'm sure I did not run out of disk space. The input and output file had 6.8M small events. The output file seems to be useable. It seems we hit a limit - I wonder what we hit and how to avoid it in the future. I can provide a setup to reproduce if needed.