Open wesketchum opened 8 months ago
We lowered TP thresholds (from 120 in SimpleThreshold alg to 100), and see the TP rates much higher, and these errors seem to have gone away.
We can add debug messages to help understand what is happening when the tpwriter
generates these error messages, but my theory is that the rate of TPs is so low that the tp_datahandler
doesn't get any for some number of seconds, and when it does get one, it sends out a TPSet that includes one or more TPs that are significantly stale.
To support this theory, I found that the nominal start and end times of TimeSlice 80 for run 24491 in the TPStream file were: 12:02:14 and 12:02:14 local time.
whereas the error message about the problematic TP for this TimeSlice was emitted at 11:02:29 UTC, some 15 seconds later
2024-Mar-11 11:02:29,716 ERROR [void dunedaq::dfmodules::TPStreamWriter::do_work(std::atomic<bool>&) at /tmp/root/spack-stage/spack-stage-dfmodules-v2.13.0-kwalxqo6zzx622yjpvwxldsqeqhwld35/spack-src/plugins/TPStreamWriter.cpp:216] A problem was encountered when writing TimeSlice number 80 in run 24491 DAQModule: tpswriter
We could look into increasing the time interval that the TPWriter uses to accumulate TPs before writing them out...
Thanks @bieryAtFnal for looking at this. If I understand the code, it's in part a consequence that the TPSet generation is entirely data-driven, and so there's no mechanism for 'closing' a TPSet until it gets a new TP that is outside the TP accumulation time window.
Unless we make significant changes to that design, I don't think there's much of a way to change this. We are unlikely to be able to increase the accumulation window that can handle low rates without adversely affecting the collection/performance. So the error on the tpwriter will remain, and I think it should somehow be handled?
Tested today with. Still seeing some errors:
Unable to create the dataset "TR_Builder_0x00000001_TimeSliceHeader": (Links) Object already exists
which are ers::StdIssue
, though the other errors directly from dfmodules/tpwriter are now warnings and look good.
Getting fairly frequent tpwriter errors when running with TP generation at NP04 in fddaq-v4.3.0
This seems to have come up when noise levels dropped and thus the TP rate is very low: we see lots of warnings on tardy input sets and some data request timeouts from readout application TP buffers. Looking at the tpstream files, these datasets do exist: so I'm wondering if this error is coming from late arriving data that ends up prompting an attempt at writing a timeslice header that already exists? Or something like that?