bids-standard / pybids

Python tools for querying and manipulating BIDS datasets.
https://bids-standard.github.io/pybids/
MIT License
222 stars 122 forks source link

RFC: PyBIDS 1.0 layout API #989

Open effigies opened 1 year ago

effigies commented 1 year ago

Context: Over the last year or so, we've looked into replacing the SQLAlchemy-based BIDSLayout with one based on ANCP-BIDS. In the process, we've found issues in the current API where we are forced to decide whether it's worth shimming the old behavior in to avoid breaking things for users (if there are any).

Ultimately, we've decided that the API is too large and has features that seem to have more maintenance burden than value to users. So we want to propose a new API.

We would like this API to be:

1) Small. Understanding it should take ~10 minutes. 2) Reimplementable. If you have a better implementation, it should be cheap for people to switch to yours. 3) Schema-based. The BIDS standard now has a schema; it should be possible to swap in an updated schema to support unmerged BEPs.

PyBIDS 1.0 layout API

Base types (see PaddedInt):

Index = PaddedInt
Value = str
Entity = Literal['subject', 'session', ...]  # Could be an enum...

Data structures:

class Schema:
    """Representation of the state of BIDS schema

    This replaces the concept of a config file, allowing PyBIDS to accept
    BEP or application-specific schemas.

    For the time being, this should be considered opaque to users.
    """

class File:
    """Generic file representation

    This permits unparseable filenames to be represented.
    """
    path: Path
    _layout: BIDSLayout | None

    # Implement os.PathLike
    def __fspath__(self) -> str:
        return str(self.path)

    @cached_property
    def relative_path(self) -> Path:
        if not self._layout:
            raise ValueError
        return self.path.relative_to(self._layout.root)

class BIDSFile(File):
    entities: dict[Entity, Value | Index]
    datatype: str | None
    suffix: str | None
    extension: str | None

    @cached_property
    def metadata(self) -> dict[str, Any]:
        """Sidecar metadata aggregated according to inheritance principle"""
        if not self._layout:
            raise ValueError
        ...

class BIDSLayout:
    # Inspired by context: https://github.com/bids-standard/bids-specification/blob/master/src/schema/meta/context.yaml
    schema: Schema
    root: Path
    dataset_description: dict[str, Any]

    # Non-conforming files, maybe better name?
    # could be property
    ignored: list[File]

    @cached_property
    def files(self) -> list[BIDSFile]: ...

    @cached_property
    def datatypes(self) -> list[str]: ...

    @cached_property
    def modalities(self) -> list[str]: ...

    @cached_property
    def subjects(self) -> list[str]: ...

    @cached_property
    def entities(self) -> list[Entity]: ...

    def get_entities(
        self,
        entity: Entity,
        **filters,
    ) -> list[Value | Index]: ...

    def get_metadata(self, term: str, **filters) -> list[Any]: ...
    def get_files(self, **filters) -> list[BIDSFile]: ...

It can be useful to group multiple datasets into a single logical layout, for example, when querying raw data and derivatives together. Therefore we also propose an explicit collection of layouts, such that each method/property is implemented by aggregating over all of the collected layouts.

class LayoutCollection(BIDSLayout):
    primary: BIDSLayout
    layouts: list[BIDSLayout]

mylayout = LayoutCollection('/path/to/ds')
mylayout.primary.entities  # Only once
mylayout.entities  # Iterative

layout.utils

A module will be provided to provide utilities that only rely on the BIDSLayout API. The goal is to provide implementations of common tasks that will survive refactors and serve as exemplar code for working with the API.

def get_bval(x: BIDSFile, layout: BIDSLayout | None = None) -> BIDSFile: ...
def get_bvec(x: BIDSFile, layout: BIDSLayout | None = None) -> BIDSFile: ...
def get_fieldmap(x: BIDSFile, layout: BIDSLayout | None = None) -> list[BIDSFile]: ...

Some (potentially layout-free) parsing/creation seems useful:

def parse_file(x: Path | str) -> BIDSFile:
    """Construct a layout-free BIDSFile"""

def new_file(
    template: BIDSFile | None = None,
    layout: BIDSLayout | None = None,
    *,
    datatype: str | None = None,
    suffix: str | None = None,
    extension: str | None = None,
    **entities,
) -> BIDSFile:
    """Generate new file

    If given a template, additional arguments are overrides.
    """

Please let us know if there are things you need to do that cannot be done with this API. We recognize that this will break things, but if there are features you use that require more than 10-20 lines of code to reimplement without access to BIDSLayout internals, please let us know.

pvandyken commented 1 year ago

Is the .get() method being eliminated? Not sure if this is an oversight, but if so, it seems a shame, as it's by far the method I most use and get_files adds an extra 6 characters of typing.

effigies commented 1 year ago

Dropping get() was intentional because it currently is an omnibus interface that retrieves files objects, file paths, entity keys, entity values, metadata keys, metadata values, subject dirs and session dirs. It seems much cleaner to separate out the current uses that we want to keep into separate functions. We could keep one of them as get(), and I think it would have to be what we currently have as get_files(), as that was the default return type for the current .get(). Anything else would likely lead to silent failures.

effigies commented 1 year ago

Question for the group: Do we want to architect this as abstract base classes, to facilitate reimplementation, or keep everything concrete for now?

pvandyken commented 1 year ago

From a typing perspective, the most flexible (and Pythonic?) would probably be a Protocol, which would allow you to sub out the implementation without requiring the implementation to explicitly be a subclass of the pybids BIDSLayout. So my preference would be Protocol > ABC > Concrete implementation.

effigies commented 1 year ago

Okay. I'll try coding it up as a collection of protocols. I've only used them as a hack before, I'm curious to try this.

effigies commented 1 year ago

Realizing I should have tagged @erdalkaraca and @ghisvail on this. I suspect you will have opinions.

wasciutto commented 1 year ago

Often I am wanting to query and compare at the conceptual level of images instead of having to deal with particular files and their extensions.

To this end, something I would find useful is a BIDSImage class which abstracts away file-level information and lets me deal with all the information pertaining to a particular image - specifically, combining .nii images with their .json sidecar and .tsv timing files, although I could see cases where it would be nice to join more information. For example, joining .nii with its corresponding confounds in the fmriprep derivative dataset.

Rough example of what I'm thinking:

class BIDSImage():
    # least common denominator of source entities
    entities: dict[Entity, Value | Index]
    # suffix of the root image
    suffix: str | None

    def __init__(root_image: BIDSFile, related_files: BIDSFile): pass
    # Return path to the root image file
    def get_root_file() -> Path: pass
    # Return list of all files this class was derived from
    def get_files() -> list[Path]: pass
    # Shared metadata from all files aggregated
    def metadata(self) -> dict[str, Any]: pass
effigies commented 1 year ago

Thanks for the feedback!

I'm not sure that we want this additional abstraction in pybids; more new classes is more cognitive overhead for downstream users.

But making sure you have the tools to quickly build it yourself if it is useful to you is in scope. entities, suffix and get_root_file() come for free from BIDSFile, and possibly metadata, though I'm not sure what "shared metadata from all files aggregated" means. As to get_files(), this seems like the list of inputs passed to __init__, so if we make sure you can do that, we should be good.