Open PALoizeau opened 2 years ago
Thanks for the excellent find and the awesome documentation.
The missing timeslices obviously affect most of the 12 worker processes. On first analysis, they correlate perfectly with error messages "INFO: worker disconnected" in the flesctl logs.
There is a known glitch in the timeslice interface concerning the connection heartbeating. It seems to be that a worker will disconnect if it sees no new timeslice within the heartbeat timeout interval. This was first discovered recently by Esteban. While the worker subsequently reconnects automatically, timeslices built in the meantime will not be stored. This behaviour will need further investigation.
In this run, the timeslice duration was 128 ms. With 12 workers, we expect a new timeslice for each worker on average every 1.536 s. The connection heartbeat timeout is currently set to 2 s. This explains why this "usually" works, but in situations where the timeslice buffer is actually used to smooth out the data rate, it seems plausible that the 2 s will be exceeded.
For the time being, I will increase the timeout to circumvent the issue.
During the Nickel recording of May 25th 2022, the recording mode used was expected to result in an equal share of TS distributed over all 12 disks in use. The archiving settings were the following (flesctl):
When running the BMon detector monitor software in offline mode over the full dataset with TS re-ordering and looking at the printout of the TS indices, I however observed that a small proportion of the TS are missing. The printout for the last TS in each of the runs with target gives the following losses:
I then had a look at one of the runs with the longer statistics to see which timeslices where missing and if there was a pattern and observed the following sequences:
Caveat: I did not have the time yet to check if the timeslice header time is also showing a gap or if only the index is jumping. So I could imagine two possible explanation/issues:
Probably something for @cuveland