DUNE-DAQ / ehn1-operations-issues

Non-code repo used specifically for keeping track of DAQ operations-related issues
0 stars 0 forks source link

tpwriter errors on TimeSliceHeader dataset creation: "Object already exists" #30

Open wesketchum opened 7 months ago

wesketchum commented 7 months ago

Getting fairly frequent tpwriter errors when running with TP generation at NP04 in fddaq-v4.3.0

Unable to create the dataset "TR_Builder_0x00000001_TimeSliceHeader": (Links) Object already exists

This seems to have come up when noise levels dropped and thus the TP rate is very low: we see lots of warnings on tardy input sets and some data request timeouts from readout application TP buffers. Looking at the tpstream files, these datasets do exist: so I'm wondering if this error is coming from late arriving data that ends up prompting an attempt at writing a timeslice header that already exists? Or something like that?

wesketchum commented 7 months ago

We lowered TP thresholds (from 120 in SimpleThreshold alg to 100), and see the TP rates much higher, and these errors seem to have gone away.

bieryAtFnal commented 7 months ago

We can add debug messages to help understand what is happening when the tpwriter generates these error messages, but my theory is that the rate of TPs is so low that the tp_datahandler doesn't get any for some number of seconds, and when it does get one, it sends out a TPSet that includes one or more TPs that are significantly stale.

To support this theory, I found that the nominal start and end times of TimeSlice 80 for run 24491 in the TPStream file were: 12:02:14 and 12:02:14 local time.

whereas the error message about the problematic TP for this TimeSlice was emitted at 11:02:29 UTC, some 15 seconds later

2024-Mar-11 11:02:29,716 ERROR [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /tmp/root/spack-stage/spack-stage-dfmodules-v2.13.0-kwalxqo6zzx622yjpvwxldsqeqhwld35/spack-src/plugins/TPStreamWriter.cpp:216] A problem was encountered when writing TimeSlice number 80 in run 24491 DAQModule: tpswriter

We could look into increasing the time interval that the TPWriter uses to accumulate TPs before writing them out...

wesketchum commented 7 months ago

Thanks @bieryAtFnal for looking at this. If I understand the code, it's in part a consequence that the TPSet generation is entirely data-driven, and so there's no mechanism for 'closing' a TPSet until it gets a new TP that is outside the TP accumulation time window.

Unless we make significant changes to that design, I don't think there's much of a way to change this. We are unlikely to be able to increase the accumulation window that can handle low rates without adversely affecting the collection/performance. So the error on the tpwriter will remain, and I think it should somehow be handled?

wesketchum commented 7 months ago

Tested today with. Still seeing some errors:

Unable to create the dataset "TR_Builder_0x00000001_TimeSliceHeader": (Links) Object already exists

which are ers::StdIssue, though the other errors directly from dfmodules/tpwriter are now warnings and look good.