Open adelavega opened 1 year ago
ancpbids seems to load all derivatives eagerly, even if derivatives=True is not set. This actually is fairly important as you may specifically not want to load derivatives. We may want to discuss if its better to "filter" derivatives prior to instantiation of a layout, or at the time of a query. I think the latter is more flexible and generic, but I think there may be times you really don't want to index a derivative.
I think parsing the dataset structure (on file system) is fast enough in ancpbids, such that there is no need to omit parsing of derivatives. I would prefer to ignore derivatives when querying a dataset which is already possible via the scope parameter.
Yes, that's what I figured. In general, I think the big philosophical difference between pybids & ancpbids is that pybids has various ways to effectively filter on ingestion, rather than at query time.
This was done for a few reasons: 1) In part due to not having a powerful enough query language to exclude results at query time that were not relevant 2) Also made queries shorter since often those filters would be applied to all subsequent queries in a session (i.e. if i'm never interested in derivatives, or only select derivatives, I would only want to say that once at the beginning). 3) To make indexing faster (no longer relevant).
I'm in favor of in the future going in the direction of ancpbids and letting more of this to occur at query time.
That said, this will be a breaking change
Also a question: Can you load derivatives that are in a different directory (i.e. not under derivatives/
) with ancpbids?
Yes, that's what I figured. In general, I think the big philosophical difference between pybids & ancpbids is that pybids has various ways to effectively filter on ingestion, rather than at query time.
This was done for a few reasons: 1) In part due to not having a powerful enough query language to exclude results at query time that were not relevant 2) Also made queries shorter since often those filters would be applied to all subsequent queries in a session (i.e. if i'm never interested in derivatives, or only select derivatives, I would only want to say that once at the beginning). 3) To make indexing faster (no longer relevant).
I'm in favor of in the future going in the direction of ancpbids and letting more of this to occur at query time.
That said, this will be a breaking change
What could be done is to ignore the derivatives subtree (in the in-memory graph) by default from queries if derivatives=False is provided.
Also a question: Can you load derivatives that are in a different directory (i.e. not under
derivatives/
) with ancpbids?
You mean in a deeply nested folder within derivatives? For example, detivatives/pipeline_1/test1/derivative? ancpbids assumes any (sub-)folder within derivatives/ may be a derivative folder (i.e. may contain the dataset_description.json), if not, it will still be able to query files/artifacts.
Also a question: Can you load derivatives that are in a different directory (i.e. not under
derivatives/
) with ancpbids?You mean in a deeply nested folder within derivatives? For example, detivatives/pipeline_1/test1/derivative? ancpbids assumes any (sub-)folder within derivatives/ may be a derivative folder (i.e. may contain the dataset_description.json), if not, it will still be able to query files/artifacts.
No I mean a directory not inside of the main directory at all.
For example the main dataset may be at /data/dataset
and the derivative may be at /data/fmriprep/derivative_1
Yes, that's what I figured. In general, I think the big philosophical difference between pybids & ancpbids is that pybids has various ways to effectively filter on ingestion, rather than at query time. This was done for a few reasons: 1) In part due to not having a powerful enough query language to exclude results at query time that were not relevant 2) Also made queries shorter since often those filters would be applied to all subsequent queries in a session (i.e. if i'm never interested in derivatives, or only select derivatives, I would only want to say that once at the beginning). 3) To make indexing faster (no longer relevant). I'm in favor of in the future going in the direction of ancpbids and letting more of this to occur at query time. That said, this will be a breaking change
What could be done is to ignore the derivatives subtree (in the in-memory graph) by default from queries if derivatives=False is provided.
This might work. In such cases, I'd like to guide these decisions by the cost of implementing/maintenance vs cost of making a backwards breaking change.
There is a recent/related discussion:
https://github.com/ANCPLabOldenburg/ancp-bids/discussions/68
@ghisvail
This is actually currently how you load a derivative dataset as a primary dataset:
layout = BIDSLayout(dataset_path, derivatives=dataset_path)
In a way you say its the main raw and a derivative dataset. Not ideal
In a way you say its the main raw and a derivative dataset. Not ideal
I have had a good read of the derivative spec recently, and I am somehow feeling there is mismatch between both the pybids
and ancpbids
APIs and the spirit of the spec.
There are a few specific aspects that caught my attention:
Storage conventions. Derivatives may be stored in-tree (under /derivatives
) or out-of-tree. The current API fails at covering the out-of-tree convention.
Non-compliant derivatives may be stored in-tree under the derivatives
folder, for instance FreeSurfer's output. How are these taken into account by the derivatives loading logic?
Provenance of derivatives are stored under the SourceDatasets
field in dataset_description.json
.
So, assuming eager loading is desired, it sounds more intuitive to have it implemented the other way around: when loading a derivative, load the associated source dataset(s). This way, you can avoid loading a derivative which has nothing to do with the source dataset, and load derivatives in both in-tree and out-of-tree storage conventions.
I can only answer for pybids (and agree it's suboptimal), but I have a few comments:
In a way you say its the main raw and a derivative dataset. Not ideal
I have had a good read of the derivative spec recently, and I am somehow feeling there is mismatch between both the
pybids
andancpbids
APIs and the spirit of the spec.There are a few specific aspects that caught my attention:
1. Storage conventions. Derivatives may be stored in-tree (under `/derivatives`) or out-of-tree. The current API fails at covering the out-of-tree convention.
I believe pybids fully supports this. You can list an out of tree derivative under derivatives=
2. Non-compliant derivatives may be stored in-tree under the `derivatives` folder, for instance FreeSurfer's output. How are these taken into account by the derivatives loading logic?
I believe if there is no dataset_description.json
it will not index. If it does have one, but the files are non compliant, the behavior will depend on the validate
argument.
3. Provenance of derivatives are stored under the `SourceDatasets` field in `dataset_description.json`.
So, assuming eager loading is desired, it sounds more intuitive to have it implemented the other way around: when loading a derivative, load the associated source dataset(s). This way, you can avoid loading a derivative which has nothing to do with the source dataset, and load derivatives in both in-tree and out-of-tree storage conventions.
The problem with this suggestion is you can have a complete stand alone derivative, and that may be sufficient enough for an analysis. For example, an fmriprep
output is all that is needed for many subsequent fMRI analyses. I think you should be able to load it as a primary dataset, and pybids/ancpbids should detect that its a derivative and index it appropriately.
Let me know if I misunderstood you, because I'm not sure I understand the point of "loading derivatives which have nothing to do with the source dataset".
The problem with this suggestion is you can have a complete stand alone derivative [...] you should be able to load it as a primary dataset.
I agree with every word. AFAIK however, this is not the case for neither pybids
nor ancpbids
.
Which means you need to start with a raw dataset and "attach" a derivative to it whether it's in-tree or out-of-tree. My point was that the BIDS provenance metadata suggest the inverse relationship, i.e. you start from a derivative and load it's upstream dataset listed in SourceDatasets
.
layout = BIDSLayout(dataset_path, derivatives=dataset_path)
You can with pybids awkwardly, but not the "correct" way.
I have to think more about the SourceDatasets
suggestion. Isn't this a URL rather than a relative path? I'm not sure how that would work in practice.
I will note this discussion happened a long time ago when BIDS Derivatives were not a thing, and some expressed preference for sources being nested within derivatives dataset (i.e. derivatives/sources/{datalad_folder
). but this organizational scheme did not win out. I would suggest the presence of a derivatives/
folder suggest the current way of doing things.
Regardless, in the meantime we will mimic current behavior but its something to consider for a future breaking release.
I have to think more about the SourceDatasets suggestion. Isn't this a URL rather than a relative path? I'm not sure how that would work in practice.
My understanding is that it could either be a URL or a relative path.
I would suggest the presence of a derivatives/ folder suggest the current way of doing things.
I agree that the in-tree storage convention is probably the most used. That being said, storing derivatives out-of-tree has its benefits, like allowing raw datasets be shared read-only to avoid accidental file modifications by processing pipelines.
Agree, I'm a big fan of out-of-tree derivatives, but specifically the part about loading a derivative and then loading the referenced raw dataset is the part I'm unsure about. I suppose if its a relative path it could try to load it, and not attempt if its a remote URL
I suppose if its a relative path it could try to load it, and not attempt if its a remote URL.
Sounds appropriate to me.
Perhaps you'd want to make loading optional with something like:
dataset = BIDSLayout("/path/to/derivative/dataset", load_source_datasets=True) # False by default.
@effigies suggests keeping multple ancpbids datasets for each out of tree derivative.
That is pybids would handle the multiple derivatives by setting up various ancpbids datasets.
ANCPBIds eagery loads derivatives, whereas pybids only loads when specifically asked to (e.g.
derivatives=True
)