bids-standard / bids-specification

Brain Imaging Data Structure (BIDS) Specification
https://bids-specification.readthedocs.io/
Creative Commons Attribution 4.0 International
280 stars 164 forks source link

[ENH] use of n/a in `RECOMMENDED` fields #1982

Open dominikwelke opened 2 weeks ago

dominikwelke commented 2 weeks ago

Your idea

Hi all,

related to #876

i checked the issue tracker, but perhaps i missed something or it has been discussed before.. in this case, apologies!

I encountered a problem (in my perspective) with RECOMMENDED sidecar entries. BIDS validator complains about n/a entries, IF n/as are not explicitly allowed AND the required datatype is something else than string (e.g. numerical)

in my case, i encountered it with head circumference measures in EEG recordings

there are 2 separate points in this:

  1. BIDS validator doesn't seem to act consistently (as it seems to miss not explicitly allowed n/as if the required datatype is string)
  2. the BIDS standards might be enhanced, imo, by (automatically or explicitly) allowing n/a for all RECOMMENDED fields other than perhaps with the REQUIRED fields, allowing n/a entries shouldn't risk integrity or interpretability of the BIDS dataset. it might in the contrary increase consistency of the dataset: if i'm not mistaken, the current rules imply that i would have to remove n/a fields in a subject, even if it is just missing data and all other subjects have this information. allowing n/a enables users to have an identical set of data entries across all subjects (including the explicit information if something is missing).

what are your thoughts on that?

effigies commented 2 weeks ago

n/a was permitted for TSV files, as there needs to be a way of marking missing data in a table. In a dictionary, absence is a better marker than special values.

For some BEPs, there were conflicts between metadata that were considered enough of a best practice that it should be REQUIRED and known cases where the metadata were absent, so requiring its presence but permitting "n/a" as an escape hatch was determined to be the best way to thread that needle.

For RECOMMENDED and OPTIONAL fields, allowing that escape hatch only increases the likelihood of a tool dumping "Key": "n/a" with every permissible Key for that sidecar. Users then immediately get a dataset with no warnings, and it's not until they try to run a tool that needs a piece of metadata that anything breaks. Many tools assume that the validator has validated the data available, and so I can safely say metadata[field] and assume that I either get valid data or nothing (a KeyError is raised in Python, undefined is returned in Javascript...). Overall, this assumption is good because it reduces duplication of effort and the opportunity for tools to have mutually incompatible validation mechanisms.

  1. BIDS validator doesn't seem to act consistently (as it seems to miss not explicitly allowed n/as if the required datatype is string)

"n/a" is a valid string. If a field does not have a pattern or an enumerated set of permitted values, for all the validator can know, it's permissible. We could use a pattern (^(?!n/a$)) that specifically excludes that one value, for fields where we know it's inadmissible, but I don't think we can do it universally. There may be some fields where "n/a" would have been carved out as permissible if it were not already a valid string.

  1. the BIDS standards might be enhanced, imo, by (automatically or explicitly) allowing n/a for all RECOMMENDED fields other than perhaps with the REQUIRED fields, allowing n/a entries shouldn't risk integrity or interpretability of the BIDS dataset. it might in the contrary increase consistency of the dataset: if i'm not mistaken, the current rules imply that i would have to remove n/a fields in a subject, even if it is just missing data and all other subjects have this information. allowing n/a enables users to have an identical set of data entries across all subjects (including the explicit information if something is missing).

I don't understand why an identical set of data entries is particularly desirable. Could you elaborate on that? If I'm writing a tool to accept BIDS data and collate entries into, e.g., a dataframe, I need to be equally prepared to deal with missing fields as ones with special missing-value indicators. Otherwise I'm writing to a stricter specification than BIDS, and my tool will not work on all BIDS datasets.

dominikwelke commented 2 weeks ago

thanks for the explanations @effigies - I thought this should have come up at some point but couldnt find anything.

I don't understand why an identical set of data entries is particularly desirable. Could you elaborate on that?

you are right from a tool perspective. but BIDS datasets are human readable, and the special entry n/a has more information than a missing value (i.e. this info is supposed to be here, but is missing). it is not a strong argument, though, and compromised or even invalidated by procedures like default adding of n/as, as you described above might also be relevant with people who format their datasets themselves and write their own tools, as then the current standard and descriptions/explanations can be confusing or seem inconsistent (why is n/a allowed in one case but not another) or, like me, simply missing the mismatch for years :) . again, probably no argument for changing the definitions. it can lead to people believing they have BIDS compliant data, when they haven't, though, and the (older?) versions of bids validator that only checked filenames wont catch it.

We could use a pattern (^(?!n/a$)) that specifically excludes that one value

was my immediate idea, too, in case the standard remains unchanged. but i'm not in the validator workings.. can you only check regex rules, or would it be an option to include flags, e.g. via dictionaries with (in this case) n/a allowed or not allowed?

effigies commented 2 weeks ago

Metadata fields and their values are defined here in schema.objects.metadata (with formats in schema.objects.formats) and their requirement levels in schema.rules.sidecars. The validator constructs JSON schema from these inputs and applies it to the collated sidecar data.

Anything much more complicated than that is going to become brittle quickly. Validating "n/a" conditionally would be a serious pain because it is sometimes valid, so should not be removed.

One thing I wonder about is allowing null to be equivalent to missing. This would allow a tool to dump all permissible fields into a sidecar without affecting its validation status. The validator could filter null-valued fields before applying JSON schema validation, and all warnings and errors would be produced as normal. Tools would need to do the same, or be written in such a way as to treat null as equivalent to absent (e.g., metadata.get(field) in Python) to continue working as they do.

This may be unworkable, since the inheritance principle says that there is no such thing as unsetting, and null hasn't been permitted, so there's a very strong possibility of divergent treatment of datasets across tools. Would probably need to be a BIDS 2.0 thing.

dominikwelke commented 1 week ago

ok, so I take away that:

fine with me. should i close the issue, or do you want to keep it open @effigies ?

my only workable suggestion would then be that the docs might be a bit clearer about the inconsistent treatment of n/a fillers (allowed but discouraged in tsvs, not allowed in jsons - this last bit is not explicitly mentioned, i think) e.g. here (no need to mention that the validator wont catch all cases of unallowed n/as in jsons)