a bug in the TimeTracker service ?

pavel1murat commented 1 year ago

Describe the bug

I'm using art v3_11_01 as a framework to run parameterized MC jobs with 50M events per job.

The job configuration has the TimeTracker service enabled.

After 8,9M events, the jobs consistently crash with TimeTracker trying to insert smth into some table (an arrow below shows the line in question):

void TimeTracker::postEventProcessing(Event const&, ScheduleContext const sc) { auto const& d = data[key(sc.id())]; auto const t = chrono::duration{now() - d.eventStart}.count(); --> timeEventTable.insert( d.eventID.run(), d.eventID.subRun(), d.eventID.event(), t); }

Here is the gdb printout of what is being inserted (all libraries are built in optimized mode):

const art::TimeTracker::PerScheduleData & d @0x7f0002287fd8: {eventID = {subRun = {run = {run = 1}, subRun = 0}, event_ = 8900001}, eventStart = {d = {r = 4596628376493865}}, moduleStart = {d = {r = 0}}} const double t

To Reproduce it is likely that any job with the TimeTracker enabled and 10M events requested will crash in the same point

If this is a high-priority issue no

knoepfel commented 1 year ago

@pavel1murat, thank you for the report. By crash, do you mean an exception is thrown, a segmentation violation occurs, or ...? Without seeing the error itself and the job configuration you used, it's a little difficult to provide guidance. However, a (perhaps) similar issue occurred for Ray Culbertson (https://github.com/art-framework-suite/art/issues/116, and https://github.com/art-framework-suite/art/issues/116#issuecomment-985830394 in particular).

If you are not explicitly specifying a TimeTracker DB file, I suspect you are running into disk-size limitations for a temporary database. See https://www.sqlite.org/tempfiles.html#temporary_file_storage_locations for more details on how SQLite decides where to put temporary databases. If this is indeed the problem, then we can improve the documentation...as usual.

pavel1murat commented 1 year ago

Hi Kyle, thanks!

a) I think you diagnosed the issue correctly. Perhaps the best place for teh default location of the TimeTracker sqlite db could be a current for the job directory - and that is entirely under your control. In this case, a failed job would have a huge .db file in the working dir, making the diagnostics straightforward. b) when I didi that - specified the location of the TimeTracker file explicitly in .fcl configuration, I immediately saw that the created sqlite file was getting huge.
c) turning the TimeTracker service off resolves the issue, so simply documenting the issue should do.

pavel1murat commented 1 year ago

and , yes - for a production-quality software this is a bug - the software should be able to detect the limit and stop inserting new records , printing error diagnostics instead (at least onse)

kutschke commented 1 year ago

@pavel1murat I agree with Kyle's assessment that this is primarily a user education issue. I also agree that it would be good if art handled the full filesystem or quota exceeded with a graceful message. Of course if the sqlite file and the job log file are on the same disk you will usually loose the error message.

Do you know if you were filling up /var/tmp, /usr/tmp or /tmp? If so, did you clean up?

knoepfel commented 1 year ago

Thank you for your comments, @pavel1murat and @kutschke. Responses below:

[Pasha] Perhaps the best place for teh default location of the TimeTracker sqlite db could be a current for the job directory - and that is entirely under your control. In this case, a failed job would have a huge .db file in the working dir, making the diagnostics straightforward.

Yes, this is within art's control, and it may make sense for cases where people need to debug what's happening. It will also clutter up the work directory, which someone might have to deal with in a (e.g.) grid job or some other situation. This is a change we could discuss among the art stakeholders.

[Pasha] c) turning the TimeTracker service off resolves the issue, so simply documenting the issue should do.

TimeTracker was intended to be used in situations where basic profiling of a framework job is necessary to identify bottlenecks. That it's been used in production environments I guess indicates that its CPU/memory overhead is generally not noticeable (or noticed).

[Pasha] and , yes - for a production-quality software this is a bug - the software should be able to detect the limit and stop inserting new records , printing error diagnostics instead (at least onse)

I see this as a compliment (referring to our software as "production-quality"...or at least it should be) and also as a valid criticism. I think we can adjust the behavior without too much development while still meeting your needs.

[Rob] I also agree that it would be good if art handled the full filesystem or quota exceeded with a graceful message. Of course if the sqlite file and the job log file are on the same disk you will usually loose the error message.

For various reasons (efficiency, platform portability, etc.), art cannot generally inspect quota usage while processing events. However, art should be able to detect a failed DB insertion and then disable subsequent DB interactions (while printing a reasonable warning message). Note that when a DB insertion fails, the TimeTracker summary will not be printed—forming the summary requires the creation of temporary SQLite tables, and that requires disk space...which is used up by that point.

TL;DR: There are a few action items here. We will:

Implement better handling for this type of problem.
Improve the TimeTracker documentation.
Ask the stakeholders for their opinions on where the temporary SQLite DBs should reside by default

art-framework-suite / art

a bug in the TimeTracker service ? #132