art-framework-suite / art

The implementation of the art physics event processing framework
Other
2 stars 7 forks source link

SQL database full error #116

Closed rlcee closed 9 months ago

rlcee commented 2 years ago

Received this error message

%MSG-s ArtException:  RootOutput:outfile@EndJob 19-Nov-2021 17:16:26 CST  ModuleEndJob
---- SQLExecutionError BEGIN
  database or disk is full
---- SQLExecutionError END
%MSG
Art has completed and will exit with status 1.

I'm sure I did not run out of disk space. The input and output file had 6.8M small events. The output file seems to be useable. It seems we hit a limit - I wonder what we hit and how to avoid it in the future. I can provide a setup to reproduce if needed.

knoepfel commented 2 years ago

@rlcee, yes could you send instructions for us to reproduce the error?

rlcee commented 2 years ago
source /cvmfs/mu2e.opensciencegrid.org/setupmu2e-art.sh
git clone https://github.com/Mu2e/Offline
git clone https://github.com/Mu2e/Production
muse setup
muse build -j 20
scp mu2egpvm01:/mu2e/data/personal/rlc/shReco.fcl .
mu2e -c shReco.fcl -s /pnfs/mu2e/tape/phy-rec/rec/mu2e/TRK_VST-cosmics/vst002-r01/art/bf/56/rec.mu2e.TRK_VST-cosmics.vst002-r01.100004_00000000.art -o temp.art
the exe runs 1h and the error occurs at endjob
kutschke commented 2 years ago

This is a wild guess and may be completely irrelevant. But just in case:

Ray: are these the events with very sparse event numbers? If so, is the problem alleviated if you use compactEventRanges when reading the files?

rlcee commented 2 years ago

compactEventRanges is involved. I take the raw binary file, and using a custom source module, make X.art, which has 20% of its disk space used in the RootFileDB. I then run artcat (a fcl which just reads and writes events) with compactEventRanges set, and write Y.art, which has a small db. It is Y.art which is the input to this exe. I processed 25 other files like this and did not get this error, this is the only file. It is also the largest file I've processed, so the size is where I'd look first. The event numbers are very sparse, with only one run and subrun number.

knoepfel commented 2 years ago

Some working notes before running the job:

Number of events in input file

$ count_events --hr rec.mu2e.TRK_VST-cosmics.vst002-r01.100004_00000000.art 
rec.mu2e.TRK_VST-cosmics.vst002-r01.100004_00000000.art 1 run, 1 subrun, 6820422 events, and 0 results.
Counted events successfully for 1 specified files.

Number of elements in the RootFileDB tables

$ file_info_dumper --db-to-file rec.mu2e.TRK_VST-cosmics.vst002-r01.100004_00000000.art
$ mv db rootfile.db
$ for table in `sqlite3 rootfile.db '.tables'`; do size=$(sqlite3 rootfile "select count(*) from ${table}"); echo $size $table; done
1 EventRanges
1 RunRangeSets_EventRanges
13 FileCatalog_metadata
1 SubRunRangeSets
91 ParameterSets
1 SubRunRangeSets_EventRanges
1 RunRangeSets

The input RootFileDB is thus fairly modest in size.

Range of validity of input file

$ file_info_dumper --range-of-validity full rec.mu2e.TRK_VST-cosmics.vst002-r01.100004_00000000.art 
==============================
File: rec.mu2e.TRK_VST-cosmics.vst002-r01.100004_00000000.art
Representation: full (default)
------------------------------
 Run: 100004
  SubRun: 0 Event range: [616,243764655)

The range-of-validity is very compact.

knoepfel commented 2 years ago

The above job results in this error:

%MSG-s ArtException:  PostEndJob 23-Nov-2021 15:13:56 CST ModuleEndJob
---- Configuration BEGIN
  The process name strawHitReco was previously used on these products.
  Please modify the configuration file to use a distinct process name.
---- Configuration END
%MSG
Art has completed and will exit with status 1.

Will modify the process name to test.

rlcee commented 2 years ago

Ah, sorry, the process name problems makes the point that I grabbed the wrong file. The rec.. file is the output, please use this as the input file: /pnfs/mu2e/tape/phy-raw/raw/mu2e/TRK_VST-cosmics/vst002/art/64/41/raw.mu2e.TRK_VST-cosmics.vst002.100004_00000000.art sorry!

knoepfel commented 2 years ago

Thanks, @rlcee. Will rerun.

knoepfel commented 2 years ago

@rlcee, I am unable to reproduce the failure on our development machine:

$ art -c full.fcl -o test-compact.root                                                                                                                                                                
Conditions file: /scratch/knoepfel/mu2e-sqlite-bug/Offline/ConditionsService/data/conditions_01.txt                                                                                                                                           
Conditions lines: 75  hash: 8230064490884710176                                                                                                                                                                                               
DbService  TRACKER_VST v1_2   mu2e_conditions_prd                                                                                                                                                                                             
DbService  textFiles:                                                                                                                                                                                                                         
GlobalConstants file: /scratch/knoepfel/mu2e-sqlite-bug/Offline/GlobalConstantsService/data/globalConstants_01.txt                                                                                                                            
GlobalConstants lines: 128  hash: 10648863269909984210                                                                                                                                                                                        
 --------------- HepPDT Version 3.04.01 ---------------                                                                                                                                                                                       
found 534 particles                                                                                                                                                                                                                           
 --------------- HepPDT Version 3.04.01 ---------------                                                                                                                                                                                       
found 275 particles                                                                                                                                                                                                                           
found 3116 particles                                                                                                                                                                                                                          
24-Nov-2021 10:49:25 CST  Initiating request to open input file "raw.mu2e.TRK_VST-cosmics.vst002.100004_00000000.art"                                                                                                                         
24-Nov-2021 10:49:25 CST  Opened input file "raw.mu2e.TRK_VST-cosmics.vst002.100004_00000000.art"                                                                                                                                             
%MSG-w PROVENANCE:  TriggerResultInserter:TriggerResults@Construction  24-Nov-2021 10:49:36 CST ModuleConstruction                                                                                                                            
Source parameter compactEventRanges was set to true: enabling compact event ranges                                                                                                                                                            
creates a history that can cause file concatenation problems if a given SubRun spans                                                                                                                                                          
multiple input files.  Use with care.                                                                                                                                                                                                         

%MSG                                                                                                                                                                                                                                          
Geometry file: /scratch/knoepfel/mu2e-sqlite-bug/Offline/Mu2eG4/geom/geom_common.txt                                                                                                                                                          
Geometry lines: 18377  hash: 3636970994802260716                                                                                                                                                                                              
DbReader::fillValCache took 0.584268 s                                                                                                                                                                                                        
DbEngine confirmed purpose and version TRACKER_VST 1/2/-1                                                                                                                                                                                     
DbEngine will use max extension 3                                                                                                                                                                                                             
DbEngine found 3 groups and 24 IOV for 11 tables                                                                                                                                                                                              
DbEngine inclusive beginJob time was 0.584647 s                                                                                                                                                                                               
Begin processing the 1st record. run: 100004 subRun: 0 event: 616 at 24-Nov-2021 10:49:52 CST                                                                                                                                                 
24-Nov-2021 10:49:52 CST  Opened output file with pattern "test-compact.root"                                                                                                                                                                 
Begin processing the 2nd record. run: 100004 subRun: 0 event: 646 at 24-Nov-2021 10:49:52 CST                                                                                                                                                 
Begin processing the 3rd record. run: 100004 subRun: 0 event: 726 at 24-Nov-2021 10:49:52 CST                                                                                                                                                 
Begin processing the 5th record. run: 100004 subRun: 0 event: 769 at 24-Nov-2021 10:49:52 CST                                                                                                                                                 
Begin processing the 9th record. run: 100004 subRun: 0 event: 973 at 24-Nov-2021 10:49:52 CST                                                                                                                                                 
Begin processing the 17th record. run: 100004 subRun: 0 event: 1104 at 24-Nov-2021 10:49:52 CST                                                                                                                                               
Begin processing the 33rd record. run: 100004 subRun: 0 event: 1605 at 24-Nov-2021 10:49:52 CST                                                                                                                                               
Begin processing the 65th record. run: 100004 subRun: 0 event: 2652 at 24-Nov-2021 10:49:52 CST                                                                                                                                               
Begin processing the 129th record. run: 100004 subRun: 0 event: 4544 at 24-Nov-2021 10:49:52 CST                                                                                                                                              
Begin processing the 257th record. run: 100004 subRun: 0 event: 9110 at 24-Nov-2021 10:49:52 CST                                                                                                                                              
Begin processing the 513th record. run: 100004 subRun: 0 event: 17903 at 24-Nov-2021 10:49:53 CST                                                                                                                                             
Begin processing the 1025th record. run: 100004 subRun: 0 event: 35526 at 24-Nov-2021 10:49:53 CST                                                                                                                                            
Begin processing the 2049th record. run: 100004 subRun: 0 event: 68003 at 24-Nov-2021 10:49:53 CST                                                                                                                                            
Begin processing the 4097th record. run: 100004 subRun: 0 event: 138306 at 24-Nov-2021 10:49:54 CST                                                                                                                                           
Begin processing the 8193rd record. run: 100004 subRun: 0 event: 274689 at 24-Nov-2021 10:49:56 CST                                                                                                                                           
Begin processing the 16385th record. run: 100004 subRun: 0 event: 541156 at 24-Nov-2021 10:50:00 CST                                                                                                                                          
Begin processing the 32769th record. run: 100004 subRun: 0 event: 1093169 at 24-Nov-2021 10:50:08 CST
Begin processing the 65537th record. run: 100004 subRun: 0 event: 2184761 at 24-Nov-2021 10:50:23 CST
Begin processing the 131073rd record. run: 100004 subRun: 0 event: 4354406 at 24-Nov-2021 10:50:54 CST
Begin processing the 262145th record. run: 100004 subRun: 0 event: 8703499 at 24-Nov-2021 10:51:55 CST
Begin processing the 524289th record. run: 100004 subRun: 0 event: 17506083 at 24-Nov-2021 10:53:58 CST
Begin processing the 1048577th record. run: 100004 subRun: 0 event: 34570216 at 24-Nov-2021 10:58:04 CST
Begin processing the 2097153rd record. run: 100004 subRun: 0 event: 70775089 at 24-Nov-2021 11:06:44 CST
Begin processing the 4194305th record. run: 100004 subRun: 0 event: 146507330 at 24-Nov-2021 11:23:17 CST
24-Nov-2021 11:44:16 CST  Closed output file "test-compact.root"
24-Nov-2021 11:44:16 CST  Closed input file "raw.mu2e.TRK_VST-cosmics.vst002.100004_00000000.art"

================================================================================================================================
TimeTracker printout (sec)                        Min           Avg           Max         Median          RMS         nEvts   
================================================================================================================================
Full event                                    0.00029886    0.000435749    0.0723855    0.000354407   0.00044723     6820422  
--------------------------------------------------------------------------------------------------------------------------------
source:RootInput(read)                        4.6897e-05    5.40729e-05   0.000956523   5.1309e-05    1.14399e-05    6820422  
p1:PBTFSD:ProtonBunchTimeFromStrawDigis       2.8465e-05    3.27951e-05   0.00461781    3.1308e-05    8.08617e-06    6820422  
p1:makeSH:StrawHitReco                        3.6555e-05    4.23785e-05   0.00464055    3.9857e-05    1.33639e-05    6820422  
p1:makePH:CombineStrawHits                     9.054e-06    1.05354e-05   0.00101552    1.0215e-05    1.78173e-06    6820422  
p1:FlagBkgHits:FlagBkgHits                    1.8437e-05    3.36267e-05   0.00054504    3.8628e-05    9.58938e-06    6820422  
[art]:TriggerResults:TriggerResultInserter     4.88e-06     5.6965e-06    0.000244168    5.548e-06    7.46937e-07    6820422  
end_path:outfile:RootOutput                    3.342e-06    4.0227e-06    0.000242044    3.907e-06    6.19641e-07    6820422  
end_path:outfile:RootOutput(write)            8.2478e-05    0.000162243    0.0719405    9.2108e-05    0.00042238     6820422  
================================================================================================================================
DbEngine::endJob
    Total time in reading DB: 1.21966 s
    Total time waiting for locks: 3e-06 s
    Total time in locks: 0.690771 s
    cache memory   : 500589 b
    valcache memory: 20698 b

TrigReport ---------- Event summary -------------
TrigReport Events total = 6820422 passed = 6820422 failed = 0

TrigReport ---------- Trigger-path summary ------------
TrigReport    Path ID        Run     Passed     Failed      Error Name
TrigReport          0    6820422    6820422          0          0 p1

TrigReport ---------- End-path summary ---------
TrigReport        Run    Success      Error
TrigReport    6820422    6820422          0

TrigReport ---------- Modules in path: p1 ------------
TrigReport    Path ID    Visited     Passed     Failed      Error Name
TrigReport          0    6820422    6820422          0          0 PBTFSD
TrigReport          0    6820422    6820422          0          0 makeSH
TrigReport          0    6820422    6820422          0          0 makePH
TrigReport          0    6820422    6820422          0          0 FlagBkgHits

TrigReport ---------- Modules in End-path ----------
TrigReport        Run    Success      Error Name
TrigReport    6820422    6820422          0 outfile

TrigReport ---------- Module summary ------------
TrigReport    Visited        Run     Passed     Failed      Error Name
TrigReport    6820422    6820422    6820422          0          0 FlagBkgHits
TrigReport    6820422    6820422    6820422          0          0 PBTFSD
TrigReport    6820422    6820422    6820422          0          0 makePH
TrigReport    6820422    6820422    6820422          0          0 makeSH
TrigReport    6820422    6820422    6820422          0          0 outfile

TimeReport ---------- Time summary [sec] -------
TimeReport CPU = 3783.421631 Real = 3787.711239

MemReport  ---------- Memory summary [base-10 MB] ------
MemReport  VmPeak = 2788.81 VmHWM = 2069.2

Art has completed and will exit with status 0.

See the attached configuration file, which I generated by typing:

art -c shReco.fcl raw.mu2e.TRK_VST-cosmics.vst002.100004_00000000.art -o test.root --debug-config full.fcl

I then modified the source configuration to use compactEventRanges: true.

knoepfel commented 2 years ago

@rlcee, I also ran the same job without compact event ranges, and I also had no problems. Sometimes, SQLite does not work well with nfs systems. Which machine were you using when running your job?

rlcee commented 2 years ago

It was running on mu2egpvm04. In the processing of these files I did hit conditions where other users were limiting resources. Here is the link to the node monitor at the time (11/19 17:15) and it looks OK to me. monitor. Should I try it again there on 04?

knoepfel commented 2 years ago

Hmm...if you don't mind trying it on 04 one more time, I'd appreciate it. If it's still a problem, then I'll move over to that machine for debugging.

rlcee commented 2 years ago

Running the command on mu2egpvm04 today does reproduce the error.

knoepfel commented 2 years ago

Thanks, @rlcee. I will take a look when I get back from vacation.

knoepfel commented 2 years ago

@rlcee, I have reproduced the error on mu2egpvm04 when running over all 6M events. If I run the job with 3M events instead of all 6M, I do not have a problem. I am debugging right now.

knoepfel commented 2 years ago

@rlcee, I believe I understand what is happening. The problem is actually not with the RootFileDB; the problem is with the TimeTracker database usage. Important points:

I will provide more information later--at the very least, there is a bug in how the exception is being reported (i.e. the context of RootOutput:outfile@EndJob in the logged message is incorrect).

knoepfel commented 2 years ago

@rlcee, the problem is understood. As mentioned above, this is an exception triggered by the TimeTracker service. Notice that the time-tracker summary is missing from the end-of-job summary:

...
Begin processing the 4194305th record. run: 100004 subRun: 0 event: 146507330 at 03-Dec-2021 12:49:22 CST
03-Dec-2021 13:21:49 CST  Closed output file "test.root"
03-Dec-2021 13:21:49 CST  Closed input file "/pnfs/mu2e/tape/phy-raw/raw/mu2e/TRK_VST-cosmics/vst002/art/64/41/raw.mu2e.TRK_VST-cosmics.vst002.100004_00000000.art"

[[---> TimeTracker summary would go here <---]]

TrigReport ---------- Event summary -------------
TrigReport Events total = 6820422 passed = 6820422 failed = 0

TrigReport ---------- Trigger-path summary ------------
TrigReport    Path ID        Run     Passed     Failed      Error Name
TrigReport          0    6820422    6820422          0          0 p1

TrigReport ---------- End-path summary ---------
TrigReport        Run    Success      Error
TrigReport    6820422    6820422          0

...

%MSG-s ArtException:  RootOutput:outfile@EndJob 03-Dec-2021 13:24:21 CST  ModuleEndJob
---- SQLExecutionError BEGIN
  database or disk is full
---- SQLExecutionError END
%MSG
Art has completed and will exit with status 1.

Ostensibly, this occurs due to a full disk. This is explained below.

TimeTracker behavior

Consider the configuration used for this job:

services.TimeTracker: {  printSummary: true }

The TimeTracker uses an SQLite database to record execution times of each module for each event. A summary of the execution times is generated by querying the database and calculating averages, maxima, minima, etc. For jobs with many events, this database becomes very large, and the query statements that calculate meaningful metrics for the summary create additional temporary tables that take up disk space.

For this particular configuration, no filename is provided. SQLite thus creates a temporary database file in the /var/tmp subdirectory, which is removed after the last connection to the SQLite database has been closed. On mu2egpvm04, the size of /var/tmp/ is 7.8 GB. Consider the growth of /var/tmp throughout the course of the job:

Using df -h /var/tmp Used Avail Use%
Begin job 3.5G 4.0G 47%
Done processing events/before end job 6.5G 896M 89%

The size of the TimeTracker database is thus 3.0G before end job, and during the end job stage, the TimeTracker summary generation consumes the remaining 896M, thus ending the program with the above exception. To avoid the exception see https://github.com/art-framework-suite/art/issues/116#issuecomment-985730151.

Action items

Exception reporting

As mentioned earlier:

[T]here is a bug in how the exception is being reported (i.e. the context of RootOutput:outfile@EndJob in the logged message is incorrect).

So that needs to be addressed--but that is a separate bug report. Could you fill one out for us? Specifically that the context of the logged message is incorrect.

What should be the expected behavior of TimeTracker?

The TimeTracker was intended to provide rudimentary profiling information for identifying event- and module-processing bottlenecks. Recording 6M events' worth of profiling data has clearly brought it to its knees. Although we could conceive of ways to minimize the TimeTracker footprint for jobs that only want the summary (and not the entire database file), it's not clear to me that this is the correct approach. We will discuss this issue at the next art stakeholder meeting.

knoepfel commented 9 months ago

Closing because the proximate issue is resolved. The exception message problem is documented in #118, and if additional discussion is required for where to place the SQLite database, we can have that at a separate time.