bids-standard / bids-specification

Brain Imaging Data Structure (BIDS) Specification
https://bids-specification.readthedocs.io/
Creative Commons Attribution 4.0 International
264 stars 154 forks source link

feat(schema): Provide default JSON column definition for "conventional" columns #1838

Open effigies opened 1 month ago

effigies commented 1 month ago

In BIDS we have three types of columns:

1) Fully-defined columns, such as participant_id or the type column in electrodes.tsv. These are ones where having JSON-schema style definitions of values is convenient, as we can rely on a well-established way of validating values. There is no good use case for overriding these, since the meaning should not vary from dataset to dataset. 2) Example columns used in the specification, what I am calling "conventional" columns, such as age or handedness. These are used in the spec and it was decided some time ago that we will let people use them without complaining about being ill-defined, as long as they use them consistently with the specification. 3) User-defined columns are not provided in the specification, and the descriptions of the columns SHOULD go in the JSON sidecar.

For (1) there is no good use case of overriding, since the meaning should not vary from dataset to dataset. For (2) these definitions may be broader or narrower than is really useful, and we would like people to define what they mean. For convenience, tools should fall back to the BIDS spec for interpretation. For (3) the data are unlikely to be usable in an automated fashion without a JSON definition.

Currently (1) and (2) are defined in the same way in the schema, which leaves a gap because our column description object is not as powerful as JSON schema. When overriding conventional columns, the user has no obvious way to produce an equivalent representation, and tools have no way to help the user by injecting the default definition and inviting the user to refine it.

This PR moves type (2) to align with type (3), so that conventional columns are validated the same way as user-defined columns, and examples are available in the specification. We do this by adding an optional definition field to entries in objects.columns which is the JSON default. It is intentionally written as a JSON object inside YAML. This change also allows a validator to identify (1) and (2) columns purely through structure, rather than adding a new flag.

We have already implemented this in the schema validator in https://github.com/bids-standard/bids-validator/pull/1987 and https://github.com/bids-standard/bids-validator/pull/2003, so this is known to work with the existing BIDS examples.

The logic is as follows:

1) If a column has a schema definition (type 1), validate using that schema. Warn on attempted override. 2) If a column has a JSON definition (type 2), use JSON column definition validation. Use an overriding definition from the JSON sidecar, if found, but fall back to the in-schema JSON definition. 3) If a column has no JSON definition, but there is a JSON definition in the sidecar, validate using JSON validation. 4) If a column has none of the above, warn/error about missing definitions, depending on whether the file type permits undefined additional columns.

This is an edited post. Expand for original text.

Some TSV columns are very well defined, while others are conventional. As an example, the `type` column of `electrodes.tsv` has a very explicit meaning and list of possible values; there's no reason that someone would add a description in `electrodes.json`, and if they did, the validator should not allow it to override the specification definition. In contrast the `handedness` column of `participants.tsv` is conventional in that we provide suggested values and their interpretations, but we would really prefer if people provide a full definition in `participants.json`. There are handedness definitions that are not refinements or subsets; for example, the Edinburgh Handedness Inventory is a numeric scale, not a set of labels. I would like to propose that for conventional columns, we provide an explicit example JSON definition for the column. Validators would be expected to look first for a sidecar definition, fall back to the schema-defined example, and use whatever definition is available to validate the values in the column. In the case where neither is available, we would raise a warning, as we currently do. If I get some agreement here, I will do the same thing for all other conventional columns I can find. For fully-defined columns, we will continue using JSON schema-style type definitions, which are more explicit.

cc @sappelhoff who I believe was involved in helping to formalize these last time.

effigies commented 1 month ago

This column definition was initially defined in https://github.com/bids-standard/bids-specification/pull/827 by @tsalo which was also approved by @Remi-Gau. Feedback from any of you would be appreciated.

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 92.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 88.06%. Comparing base (63ef041) to head (ab41d81).

:exclamation: Current head ab41d81 differs from pull request most recent head 019c20f

Please upload reports for the commit 019c20f to get more accurate results.

Files Patch % Lines
tools/schemacode/bidsschematools/render/utils.py 86.66% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1838 +/- ## ========================================== + Coverage 88.04% 88.06% +0.02% ========================================== Files 16 16 Lines 1380 1391 +11 ========================================== + Hits 1215 1225 +10 - Misses 165 166 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

sappelhoff commented 3 weeks ago

I would like to propose that for conventional columns, we provide an explicit example JSON definition for the column. Validators would be expected to look first for a sidecar definition, fall back to the schema-defined example, and use whatever definition is available to validate the values in the column. In the case where neither is available, we would raise a warning, as we currently do. If I get some agreement here, I will do the same thing for all other conventional columns I can find.

For fully-defined columns, we will continue using JSON schema-style type definitions, which are more explicit.

That sounds reasonable to me. Shall there be some "info" logging for these cases? 👇

fall back to the schema-defined example

effigies commented 3 weeks ago

Shall there be some "info" logging for these cases? 👇

fall back to the schema-defined example

I don't think so. debug logging perhaps, but the standing expectation is that if you use approved values for these columns, the validator won't complain.