SainsburyWellcomeCentre / aeon_mecha

Project Aeon's main library for interfacing with acquired data. Contains modules for raw data file io, data querying, data processing, data qc, database ingestion, and building computational data pipelines.
BSD 3-Clause "New" or "Revised" License
6 stars 6 forks source link

Incorrect prioritization of multiple input root directories when calling `api.load()` #390

Closed ttngu207 closed 3 months ago

ttngu207 commented 3 months ago

Related to #227

The current implementation of api.load, when multiple root dirs being specified, has the prioritization of the input directories being reversed. This code block in particular

I think the simple fix is to change this line as below:

for path in root -> for path in root[::-1]

Code to reproduce the error

from aeon.schema import schemas

chunk_start = datetime.datetime(2024, 6, 19, 11, 10, 12)
chunk_end = datetime.datetime(2024, 6, 19, 12, 10, 12)

data_dirs = [PosixPath('/ceph/aeon/aeon/data/processed/AEON4/social0.3'),
 PosixPath('/ceph/aeon/aeon/data/raw/AEON4/social0.3')]

stream_reader = schemas.social03.CameraTop.Pose

pose_data = io_api.load(
    root=data_dirs,
    reader=stream_reader,
    start=pd.Timestamp(chunk_start),
    end=pd.Timestamp(chunk_end),
)

Data files exist in both processed and raw, with processed containing the fullpose SLEAP data. Running the above will load the data from raw instead of processed, even though the processed dir is first in the data_dirs list.

Running the below will achieve the expected behavior, actually loading from processed dir

pose_data = io_api.load(
    root=data_dirs[::-1],
    reader=stream_reader,
    start=pd.Timestamp(chunk_start),
    end=pd.Timestamp(chunk_end),
)
glopesdev commented 3 months ago

@ttngu207 This is by design, where folders are scanned in the order they are added in the root list. Is there a particular reason why using the reverse order is interesting? If we do this it will be a breaking change for older notebooks and scripts.

glopesdev commented 3 months ago

Also the proposed change would break if root is an iterable sequence rather than a list, since generic iterables may have no indexer, and hence no ability to reverse using the [::-1] syntax.

jkbhagatio commented 3 months ago

@ttngu207, actually is it ok to just switch the 'processed' and 'raw' dir listings here when you set them?

data_dirs = [PosixPath('/ceph/aeon/aeon/data/processed/AEON4/social0.3'), PosixPath('/ceph/aeon/aeon/data/raw/AEON4/social0.3')]
ttngu207 commented 3 months ago

@glopesdev @jkbhagatio it's very straightforward to switch the order, I can do that.

It's the end behavior of the load that's a bit unexpected. For instance, when I specified the root to be [processed, raw] in that order, I'd expect the files found in processed to take precedence over files found in raw, but with the current implementation, the reverse is true (which is a bit unexpected). But if this is the desired behavior, then yes it's all correct (perhaps my assumption and expectation for the behavior of the function is off).

Yes, the folders are scanned the in order added in root list, but because of the dictionary comprehension below, the files found in the last folder will "overwrite" files found in the first folder for the same chunk_key

    fileset = {
        chunk_key(fname): fname
        for path in root
        for fname in path.glob(f"{epoch_pattern}/**/{reader.pattern}.{reader.extension}")
    }
jkbhagatio commented 3 months ago

Yes, the folders are scanned the in order added in root list, but because of the dictionary comprehension below, the files found in the last folder will "overwrite" files found in the first folder for the same chunk_key

Yes, I guess this behavior is ok and is what we want - I guess the docstring is a bit unclear

:param str or PathLike root: The root path, or prioritised sequence of paths, where epoch data is stored.

but we can there just make it clear that the provided sequence is searched in the provided order, thus latter paths will have prioritization over earlier paths?

ttngu207 commented 3 months ago

OK, agreed. I think I just have an incorrect understanding of the load function and its multi-rootpaths usage. I'll close this issue.