Derivative loading differences

adelavega commented 1 year ago

ANCPBIds eagery loads derivatives, whereas pybids only loads when specifically asked to (e.g. derivatives=True)

adelavega commented 1 year ago

ancpbids seems to load all derivatives eagerly, even if derivatives=True is not set. This actually is fairly important as you may specifically not want to load derivatives. We may want to discuss if its better to "filter" derivatives prior to instantiation of a layout, or at the time of a query. I think the latter is more flexible and generic, but I think there may be times you really don't want to index a derivative.

erdalkaraca commented 1 year ago

I think parsing the dataset structure (on file system) is fast enough in ancpbids, such that there is no need to omit parsing of derivatives. I would prefer to ignore derivatives when querying a dataset which is already possible via the scope parameter.

adelavega commented 1 year ago

Yes, that's what I figured. In general, I think the big philosophical difference between pybids & ancpbids is that pybids has various ways to effectively filter on ingestion, rather than at query time.

This was done for a few reasons: 1) In part due to not having a powerful enough query language to exclude results at query time that were not relevant 2) Also made queries shorter since often those filters would be applied to all subsequent queries in a session (i.e. if i'm never interested in derivatives, or only select derivatives, I would only want to say that once at the beginning). 3) To make indexing faster (no longer relevant).

I'm in favor of in the future going in the direction of ancpbids and letting more of this to occur at query time.

That said, this will be a breaking change

adelavega commented 1 year ago

Also a question: Can you load derivatives that are in a different directory (i.e. not under derivatives/) with ancpbids?

erdalkaraca commented 1 year ago

Yes, that's what I figured. In general, I think the big philosophical difference between pybids & ancpbids is that pybids has various ways to effectively filter on ingestion, rather than at query time.

This was done for a few reasons: 1) In part due to not having a powerful enough query language to exclude results at query time that were not relevant 2) Also made queries shorter since often those filters would be applied to all subsequent queries in a session (i.e. if i'm never interested in derivatives, or only select derivatives, I would only want to say that once at the beginning). 3) To make indexing faster (no longer relevant).

I'm in favor of in the future going in the direction of ancpbids and letting more of this to occur at query time.

That said, this will be a breaking change

What could be done is to ignore the derivatives subtree (in the in-memory graph) by default from queries if derivatives=False is provided.

erdalkaraca commented 1 year ago

Also a question: Can you load derivatives that are in a different directory (i.e. not under derivatives/) with ancpbids?

You mean in a deeply nested folder within derivatives? For example, detivatives/pipeline_1/test1/derivative? ancpbids assumes any (sub-)folder within derivatives/ may be a derivative folder (i.e. may contain the dataset_description.json), if not, it will still be able to query files/artifacts.

adelavega commented 1 year ago

Also a question: Can you load derivatives that are in a different directory (i.e. not under derivatives/) with ancpbids?

You mean in a deeply nested folder within derivatives? For example, detivatives/pipeline_1/test1/derivative? ancpbids assumes any (sub-)folder within derivatives/ may be a derivative folder (i.e. may contain the dataset_description.json), if not, it will still be able to query files/artifacts.

No I mean a directory not inside of the main directory at all.

For example the main dataset may be at /data/dataset and the derivative may be at /data/fmriprep/derivative_1

adelavega commented 1 year ago

Yes, that's what I figured. In general, I think the big philosophical difference between pybids & ancpbids is that pybids has various ways to effectively filter on ingestion, rather than at query time. This was done for a few reasons: 1) In part due to not having a powerful enough query language to exclude results at query time that were not relevant 2) Also made queries shorter since often those filters would be applied to all subsequent queries in a session (i.e. if i'm never interested in derivatives, or only select derivatives, I would only want to say that once at the beginning). 3) To make indexing faster (no longer relevant). I'm in favor of in the future going in the direction of ancpbids and letting more of this to occur at query time. That said, this will be a breaking change

What could be done is to ignore the derivatives subtree (in the in-memory graph) by default from queries if derivatives=False is provided.

This might work. In such cases, I'd like to guide these decisions by the cost of implementing/maintenance vs cost of making a backwards breaking change.

erdalkaraca commented 1 year ago

There is a recent/related discussion:

https://github.com/ANCPLabOldenburg/ancp-bids/discussions/68

@ghisvail

adelavega commented 1 year ago

This is actually currently how you load a derivative dataset as a primary dataset:

layout = BIDSLayout(dataset_path, derivatives=dataset_path)

In a way you say its the main raw and a derivative dataset. Not ideal

ghisvail commented 1 year ago

In a way you say its the main raw and a derivative dataset. Not ideal

I have had a good read of the derivative spec recently, and I am somehow feeling there is mismatch between both the pybids and ancpbids APIs and the spirit of the spec.

There are a few specific aspects that caught my attention:

Storage conventions. Derivatives may be stored in-tree (under /derivatives) or out-of-tree. The current API fails at covering the out-of-tree convention.
Non-compliant derivatives may be stored in-tree under the derivatives folder, for instance FreeSurfer's output. How are these taken into account by the derivatives loading logic?
Provenance of derivatives are stored under the SourceDatasets field in dataset_description.json.

So, assuming eager loading is desired, it sounds more intuitive to have it implemented the other way around: when loading a derivative, load the associated source dataset(s). This way, you can avoid loading a derivative which has nothing to do with the source dataset, and load derivatives in both in-tree and out-of-tree storage conventions.

adelavega commented 1 year ago

I can only answer for pybids (and agree it's suboptimal), but I have a few comments:

In a way you say its the main raw and a derivative dataset. Not ideal

I have had a good read of the derivative spec recently, and I am somehow feeling there is mismatch between both the pybids and ancpbids APIs and the spirit of the spec.

There are a few specific aspects that caught my attention:
1. Storage conventions. Derivatives may be stored in-tree (under `/derivatives`) or out-of-tree. The current API fails at covering the out-of-tree convention.

I believe pybids fully supports this. You can list an out of tree derivative under derivatives=

2. Non-compliant derivatives may be stored in-tree under the `derivatives` folder, for instance FreeSurfer's output. How are these taken into account by the derivatives loading logic?

I believe if there is no dataset_description.json it will not index. If it does have one, but the files are non compliant, the behavior will depend on the validate argument.

3. Provenance of derivatives are stored under the `SourceDatasets` field in `dataset_description.json`.
So, assuming eager loading is desired, it sounds more intuitive to have it implemented the other way around: when loading a derivative, load the associated source dataset(s). This way, you can avoid loading a derivative which has nothing to do with the source dataset, and load derivatives in both in-tree and out-of-tree storage conventions.

The problem with this suggestion is you can have a complete stand alone derivative, and that may be sufficient enough for an analysis. For example, an fmriprep output is all that is needed for many subsequent fMRI analyses. I think you should be able to load it as a primary dataset, and pybids/ancpbids should detect that its a derivative and index it appropriately.

Let me know if I misunderstood you, because I'm not sure I understand the point of "loading derivatives which have nothing to do with the source dataset".

ghisvail commented 1 year ago

The problem with this suggestion is you can have a complete stand alone derivative [...] you should be able to load it as a primary dataset.

I agree with every word. AFAIK however, this is not the case for neither pybids nor ancpbids.

Which means you need to start with a raw dataset and "attach" a derivative to it whether it's in-tree or out-of-tree. My point was that the BIDS provenance metadata suggest the inverse relationship, i.e. you start from a derivative and load it's upstream dataset listed in SourceDatasets.

adelavega commented 1 year ago

layout = BIDSLayout(dataset_path, derivatives=dataset_path)

You can with pybids awkwardly, but not the "correct" way.

I have to think more about the SourceDatasets suggestion. Isn't this a URL rather than a relative path? I'm not sure how that would work in practice.

I will note this discussion happened a long time ago when BIDS Derivatives were not a thing, and some expressed preference for sources being nested within derivatives dataset (i.e. derivatives/sources/{datalad_folder). but this organizational scheme did not win out. I would suggest the presence of a derivatives/ folder suggest the current way of doing things.

Regardless, in the meantime we will mimic current behavior but its something to consider for a future breaking release.

ghisvail commented 1 year ago

I have to think more about the SourceDatasets suggestion. Isn't this a URL rather than a relative path? I'm not sure how that would work in practice.

My understanding is that it could either be a URL or a relative path.

I would suggest the presence of a derivatives/ folder suggest the current way of doing things.

I agree that the in-tree storage convention is probably the most used. That being said, storing derivatives out-of-tree has its benefits, like allowing raw datasets be shared read-only to avoid accidental file modifications by processing pipelines.

adelavega commented 1 year ago

Agree, I'm a big fan of out-of-tree derivatives, but specifically the part about loading a derivative and then loading the referenced raw dataset is the part I'm unsure about. I suppose if its a relative path it could try to load it, and not attempt if its a remote URL

ghisvail commented 1 year ago

I suppose if its a relative path it could try to load it, and not attempt if its a remote URL.

Sounds appropriate to me.

Perhaps you'd want to make loading optional with something like:

dataset = BIDSLayout("/path/to/derivative/dataset", load_source_datasets=True)  # False by default.

adelavega commented 1 year ago

@effigies suggests keeping multple ancpbids datasets for each out of tree derivative.

That is pybids would handle the multiple derivatives by setting up various ancpbids datasets.

bids-standard / pybids-refactor

Derivative loading differences #10