bids-standard / bids-specification

Brain Imaging Data Structure (BIDS) Specification
https://bids-specification.readthedocs.io/
Creative Commons Attribution 4.0 International
274 stars 157 forks source link

[BUG] clarify/re-consider use of "n/a" in numeric columns #1938

Open yarikoptic opened 19 hours ago

yarikoptic commented 19 hours ago

Describe your problem in detail.

ATM specification for Tabular files https://bids-specification.readthedocs.io/en/stable/common-principles.html#tabular-files states

String values containing tabs MUST be escaped using double quotes. Missing and non-applicable values MUST be coded as n/a. Numerical values MUST employ the dot (.) as decimal separator and MAY be specified in scientific notation, using e or E to separate the significand from the exponent. TSV files MUST be in UTF-8 encoding.

So, in the best reading of it, it mandates use of explicit n/a for a missing value in any (not only "String values" column) column. As n/a is not a standard placeholder, that unnecessarily complicates loading of such files using anything which expects numeric values for the column (e.g. onset).

Describe what you expected.

I have not investigated this further yet and do not have any specific recommendation ATM (e.g. after looking how pandas would expect to have float.nan to be defined in tsv etc). Just raising a possible discussion point.

At least we might want to reorder sentences to remove possible misassociation with string only columns, i.e. to have it

Missing and non-applicable values MUST be coded as n/a. String values containing tabs MUST be escaped using double quotes. Numerical values MUST employ the dot (.) as decimal separator and MAY be specified in scientific notation, using e or E to separate the significand from the exponent. TSV files MUST be in UTF-8 encoding.

BIDS specification section

https://bids-specification.readthedocs.io/en/latest/...

effigies commented 19 hours ago

Yes, n/a applies to all columns, and that is how the validator has handled it the whole time. Proposing nan or another alternative for numeric columns would not change the need for tools to work with n/a in historical datasets. I'm okay with the suggested reordering, if that clarifies things.