Closed ttngu207 closed 3 months ago
@ttngu207 This is by design, where folders are scanned in the order they are added in the root list. Is there a particular reason why using the reverse order is interesting? If we do this it will be a breaking change for older notebooks and scripts.
Also the proposed change would break if root
is an iterable sequence rather than a list, since generic iterables may have no indexer, and hence no ability to reverse using the [::-1]
syntax.
@ttngu207, actually is it ok to just switch the 'processed' and 'raw' dir listings here when you set them?
data_dirs = [PosixPath('/ceph/aeon/aeon/data/processed/AEON4/social0.3'), PosixPath('/ceph/aeon/aeon/data/raw/AEON4/social0.3')]
@glopesdev @jkbhagatio it's very straightforward to switch the order, I can do that.
It's the end behavior of the load
that's a bit unexpected. For instance, when I specified the root
to be [processed, raw]
in that order, I'd expect the files found in processed
to take precedence over files found in raw
, but with the current implementation, the reverse is true (which is a bit unexpected). But if this is the desired behavior, then yes it's all correct (perhaps my assumption and expectation for the behavior of the function is off).
Yes, the folders are scanned the in order added in root list, but because of the dictionary comprehension below, the files found in the last folder will "overwrite" files found in the first folder for the same chunk_key
fileset = {
chunk_key(fname): fname
for path in root
for fname in path.glob(f"{epoch_pattern}/**/{reader.pattern}.{reader.extension}")
}
Yes, the folders are scanned the in order added in root list, but because of the dictionary comprehension below, the files found in the last folder will "overwrite" files found in the first folder for the same chunk_key
Yes, I guess this behavior is ok and is what we want - I guess the docstring is a bit unclear
:param str or PathLike root: The root path, or prioritised sequence of paths, where epoch data is stored.
but we can there just make it clear that the provided sequence is searched in the provided order, thus latter paths will have prioritization over earlier paths?
OK, agreed. I think I just have an incorrect understanding of the load
function and its multi-rootpaths usage. I'll close this issue.
Related to #227
The current implementation of
api.load
, when multiple root dirs being specified, has the prioritization of the input directories being reversed. This code block in particularI think the simple fix is to change this line as below:
for path in root
->for path in root[::-1]
Code to reproduce the error
Data files exist in both
processed
andraw
, withprocessed
containing the fullpose SLEAP data. Running the above will load the data fromraw
instead ofprocessed
, even though theprocessed
dir is first in thedata_dirs
list.Running the below will achieve the expected behavior, actually loading from
processed
dir