bids-standard / bids-specification

Brain Imaging Data Structure (BIDS) Specification
https://bids-specification.readthedocs.io/
Creative Commons Attribution 4.0 International
280 stars 165 forks source link

[BUG] clarify/re-consider use of "n/a" in numeric columns #1938

Open yarikoptic opened 2 months ago

yarikoptic commented 2 months ago

Describe your problem in detail.

ATM specification for Tabular files https://bids-specification.readthedocs.io/en/stable/common-principles.html#tabular-files states

String values containing tabs MUST be escaped using double quotes. Missing and non-applicable values MUST be coded as n/a. Numerical values MUST employ the dot (.) as decimal separator and MAY be specified in scientific notation, using e or E to separate the significand from the exponent. TSV files MUST be in UTF-8 encoding.

So, in the best reading of it, it mandates use of explicit n/a for a missing value in any (not only "String values" column) column. As n/a is not a standard placeholder, that unnecessarily complicates loading of such files using anything which expects numeric values for the column (e.g. onset).

Describe what you expected.

I have not investigated this further yet and do not have any specific recommendation ATM (e.g. after looking how pandas would expect to have float.nan to be defined in tsv etc). Just raising a possible discussion point.

At least we might want to reorder sentences to remove possible misassociation with string only columns, i.e. to have it

Missing and non-applicable values MUST be coded as n/a. String values containing tabs MUST be escaped using double quotes. Numerical values MUST employ the dot (.) as decimal separator and MAY be specified in scientific notation, using e or E to separate the significand from the exponent. TSV files MUST be in UTF-8 encoding.

BIDS specification section

https://bids-specification.readthedocs.io/en/latest/...

effigies commented 2 months ago

Yes, n/a applies to all columns, and that is how the validator has handled it the whole time. Proposing nan or another alternative for numeric columns would not change the need for tools to work with n/a in historical datasets. I'm okay with the suggested reordering, if that clarifies things.

VisLab commented 1 month ago

I think tools have adapted to n/a in all columns and such a change would trigger a lot of changes.

yarikoptic commented 1 month ago

Cool, let's then plan #1940 to fix this issue with just minute tune up to "wording".