Missing timeslices or skipped indices during the parallel acquisition with 2 nodes and 6 equal share disks

During the Nickel recording of May 25th 2022, the recording mode used was expected to result in an equal share of TS distributed over all 12 disks in use. The archiving settings were the following (flesctl):

  node8|node9)
    HOSTADDR=${1}ib2

    ######################################################################
    # stride/offset parameters here are selected to balance the write load
    # to the mass store devices (5x HDD + 1x SSD) for high capacity

    # analyze internal format and record all timeslices
    TSCLIENT_PARAM[0]="-i shm:%shm_id%?stride=6&offset=0 -a -m -o /storage/6/data/%run_id%_${1}_0_%n.tsa --output-archive-bytes 10000000000"
    TSCLIENT_PARAM[1]="-i shm:%shm_id%?stride=6&offset=1 -a -m -o /storage/1/data/%run_id%_${1}_1_%n.tsa --output-archive-bytes 10000000000"
    TSCLIENT_PARAM[2]="-i shm:%shm_id%?stride=6&offset=2 -a -m -o /storage/2/data/%run_id%_${1}_2_%n.tsa --output-archive-bytes 10000000000"
    TSCLIENT_PARAM[3]="-i shm:%shm_id%?stride=6&offset=3 -a -m -o /storage/3/data/%run_id%_${1}_3_%n.tsa --output-archive-bytes 10000000000"
    TSCLIENT_PARAM[4]="-i shm:%shm_id%?stride=6&offset=4 -a -m -o /storage/4/data/%run_id%_${1}_4_%n.tsa --output-archive-bytes 10000000000"
    TSCLIENT_PARAM[5]="-i shm:%shm_id%?stride=6&offset=5 -a -m -o /storage/5/data/%run_id%_${1}_5_%n.tsa --output-archive-bytes 10000000000"

    # publish a subsample of timeslices
    TSCLIENT_PARAM[6]="-i shm:%shm_id%?queue=skip -P"

    # analyze internal format for all timeslices
    # => To see a single line of analysis per node, use this one and remove the "-a -m" in the recorders lines
    TSCLIENT_PARAM[7]="-i shm:%shm_id% -a -m"
    ;;

When running the BMon detector monitor software in offline mode over the full dataset with TS re-ordering and looking at the printout of the TS indices, I however observed that a small proportion of the TS are missing. The printout for the last TS in each of the runs with target gives the following losses:

                                                TS proc     TS index from header
                                                   |            |
2389: [INFO] SourceTsArchive: Reading time slice 27651 (index 27674) 
      => 23 Missing (~8 / 10000)
      => 23 Missing (~8 / 10000)
2390: [INFO] SourceTsArchive: Reading time slice 15507 (index 15513)
      => 6 Missing (~4 / 10000)
2391: [INFO] SourceTsArchive: Reading time slice 54711 (index 54756)
      => 45 Missing (~8 / 10000)
2393: [INFO] SourceTsArchive: Reading time slice 745 (index 747)
      => 2 Missing (~27 / 10000)
2394: [INFO] SourceTsArchive: Reading time slice 14375 (index 14399)
      => 24 Missing (~17 / 10000)
2395: [INFO] SourceTsArchive: Reading time slice 54199 (index 54239)
      => 40 Missing (~7 / 10000)

I then had a look at one of the runs with the longer statistics to see which timeslices where missing and if there was a pattern and observed the following sequences:

2395: 40 Missing (~7 / 10000)
=>  1659,  1671, 
    6001,  6013,  6025, 
    6329,  6341, 
    8545,  8557, 
   11772, 11784, 11796, 
   11863, 11875, 11887, 
   13823, 13835, 13847
   19036, 19048, 19060,
   25576, 25588, 
   28416, 28428, 
   31547, 31559, 31571,
   33307, 33319, 33331,
   33706, 33718, 33730,
   38943, 38955, 38967,
   50874, 50886, 50898

Caveat: I did not have the time yet to check if the timeslice header time is also showing a gap or if only the index is jumping. So I could imagine two possible explanation/issues:

that some timeslices are really missing because the corresponding disk was busy/stuck, which would kind of make sense given the step of 12 in each sequence
that the TS index jumped by 1 for some reason, but then it would be hard to explain why that would happen every 12 TS

Probably something for @cuveland

Thanks for the excellent find and the awesome documentation.

The missing timeslices obviously affect most of the 12 worker processes. On first analysis, they correlate perfectly with error messages "INFO: worker disconnected" in the flesctl logs.

There is a known glitch in the timeslice interface concerning the connection heartbeating. It seems to be that a worker will disconnect if it sees no new timeslice within the heartbeat timeout interval. This was first discovered recently by Esteban. While the worker subsequently reconnects automatically, timeslices built in the meantime will not be stored. This behaviour will need further investigation.

In this run, the timeslice duration was 128 ms. With 12 workers, we expect a new timeslice for each worker on average every 1.536 s. The connection heartbeat timeout is currently set to 2 s. This explains why this "usually" works, but in situations where the timeslice buffer is actually used to smooth out the data rate, it seems plausible that the 2 s will be exceeded.

For the time being, I will increase the timeout to circumvent the issue.

cbm-fles / flesnet

Missing timeslices or skipped indices during the parallel acquisition with 2 nodes and 6 equal share disks #102