Duplicate filenames can cause issues with downstream analysis

jkbhagatio commented 1 month ago

Not a huge deal, but we realized it is possible to have duplicate filenames that are part of different epochs: e.g. if an epoch starts at 17:01, is stopped, and starts again at 17:02, we will have two different epoch folders, but duplicates of the ...T17-00-00... files in each epoch.

For some analysis we've done in the past, we aggregate all files of certain types across epochs from an experiment in a new same folder, and this aggregation poses an issue as now two files can have the exact same full path. Ofc there are other ways we can handle this, but also generally I don't love the idea of having duplicate filenames, even if they live in different directories.

One proposal would be to have the first files in an epoch be given the epoch timestamp, rather than the chunk timestamp, and have all subsequent files be given the chunk timestamp. Not sure how easy this would be to implement.

Thoughts @glopesdev @aspaNeuro ?

glopesdev commented 1 month ago

@jkbhagatio this was predicted and must be allowed for, otherwise you would have to wait one hour to restart an epoch if there is a crash. The raw loader can deal with this.

If this is an issue on ingestion time, I would suggest aggregating data frames by chunk key instead of aggregating by filename, which should always work because time is monotonic even if the epoch restarted.

jkbhagatio commented 1 month ago

@glopesdev is something like this possible?

One proposal would be to have the first files in an epoch be given the epoch timestamp, rather than the chunk timestamp, and have all subsequent files be given the chunk timestamp

glopesdev commented 1 month ago

This would break the entire loader routine, and even worse risk data inconsistency since epoch timestamps are not harp timestamps.

glopesdev commented 1 month ago

Not just loader routine but also the logging and chunking routines all work around the concept of grouping by time chunk, if we break that we will have to rethink everything.

I think not worth it for an edge case which has a clear solution: group data by time chunk instead of by filename.

SainsburyWellcomeCentre / aeon_experiments

Duplicate filenames can cause issues with downstream analysis #586