SainsburyWellcomeCentre / aeon_experiments

Experiment workflows for Project Aeon
BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

Risk of losing ephys data on robocopy race condition #547

Open glopesdev opened 1 month ago

glopesdev commented 1 month ago

We observed an unexpected mismatch in the size of one of the ephys chunks, as shown below:

image

Such a mismatch should never happen at the level of logging, even if there were to be dropped ephys data, since chunking is done based on number of samples, and not on time.

The most likely explanation is that this was a race condition when running robocopy with the /MOVE option. As described in the docs, this command "Moves files and directories, and deletes them from the source after they're copied."

From the size of the partial file, we estimate around 4 minutes of data at the end were lost. If robocopy was called to copy the file close to the end of the chunk, the partial ~75GBytes of data could have taken 4 minutes to copy over (taking into account all the other experimental files potentially in the folder). If the file were to be closed during this process of copy, then it could happen that by the time robocopy tries to delete the file, the handle is already free and the final flushed file would be deleted without subsequent copy.

This does not happen with other AEON data since chunking is aligned to hourly boundaries, while copy happens in the half-hours, thereby preventing this kind of race conditions. On ephys data because the current grouping is by sample count exclusively we do not have such guarantees.

To prevent this, we would need to find a way to guarantee non-overlap in time between closing of ephys files and robocopy transfers.