bids-standard / bids-2-devel

Discussions and suggestions of backwards incompatible changes to BIDS
https://bids.neuroimaging.io/
Creative Commons Attribution 4.0 International
10 stars 1 forks source link

N/A vs NaN for missing values #12

Open tsalo opened 3 years ago

tsalo commented 3 years ago

So this may seem like a minor gripe, but the specification is currently very clear on the “Missing and non-applicable values MUST be coded as n/a” [overt caps in the original]. I am currently battling to BIDs align large multimodal datasets across many different types of pipelines/sources (not just imaging), and for several reasons this edict has proven to be particularly unhelpful. I would suggest relaxing (or changing) the rule in 2.0 (I personally have given up and will simply adding a caveat to my readme and .json files).

Original authors: @CPLambert

tsalo commented 3 years ago

@sappelhoff wrote:

Could you briefly list the most important reasons why "n/a" has "proven to be particularly unhelpful"?

tsalo commented 3 years ago

@CPLambert wrote:

Could you briefly list the most important reasons why "n/a" has "proven to be particularly unhelpful"?

Thanks Stefan:

In a nutshell (and in my experience): NaNs are trivial to use in MATLAB, can be handled natively as numeric vectors across a whole range of operations with supported functions including logical indexing (e.g. isnan) - This is also true for a wide range of other languages including Python (numpy.isnan), R(is.nan) etc, etc., 'n/a', on the other hand, is not and requires additional conversions because it is treated as a string and needs replacing (NaN<-> 'n/a') in vectors where it occurs when writing/reading/using.

This may seem trivial, but when scaled to large mixed multimodal phenotyping datasets with >5000 fields including exports from other databases (e.g. RedCap), I have personally found the capitalised edict to become, as stated, "particularly unhelpful", not to mention un-necessarily restrictive and surely counter productive to the broader aim of BIDS?

Are there any good reasons not to relax (or change) this in version 2.0?

I can, of course, provide more important reasons if required.

BW

-Chris

jbteves commented 3 years ago

This is widely used to mean unknown/unavailable for floating-points, I think this would be a good change that makes things easier to parse.

neurolabusc commented 3 years ago

Technically, the JSON format requires numbers to be finite. I recognize that SPM intentionally uses NaNs to mean out of brain, and Matlab and NumPy handle infinity and NaN values appropriately. However, since BIDS is based on the JSON format, it is likely that many JSON tools will raise an exception if a value is reported as NaN, +Inf, -Inf (as they should, if they strictly observe the standard).

This is actually an issue with dcm2niix, which directly transcribes DICOM tags to JSON. At the moment, if a value is recorded as Infinity in the DICOM, the JSON will list infinity. What should the proper behavior of DICOM to BIDS converters be (e.g. dcm2niix and @xiangruili's dicm2nii):

  1. Transcribe DICOM values directly to JSON to retain all data. This is the general dcm2niix approach, to always assume that the DICOM file is truthful.
  2. Generate an exception, as in reality when a DICOM Tag (e.g. EchoTime) is not finite, it virtually always suggests a corrupt DICOM file.
  3. Simply omit the specific BIDS tag, as the value can not be represented (e.g. if a DICOM image reports a Echo Time of Inf+, the BIDS EchoTime tag is not generated, but all other valid tags are generated).

I have no preference, but wonder if there is a consensus. I generally consider @effigies the arbitrator for all things BIDS, so happy to hear his thoughts.

robertoostenveld commented 3 years ago

N/A vs NaN not only applies to information that is obtained from DICOM files, but also for information obtained from other raw data files (MEG, EEG, iEEG). I am not aware of information from the MEG, EEG and iEEG raw data files that needs to be specified in the sidecar JSON files where this forms a problem. MEG, EEG and iEEG raw data file headers are simpler than DICOM and usually well-behaved.

Furthermore, it can apply to information that is acquired by the researcher not using the scanner. E.g. questionnaire data, like age, sex, etc. I believe in that case that the missing data always ends up in a TSV file. Missing data in TSV files can be coded in any form since the TSV can be accompanied by a JSON data dictionary (which for example might specify that -1 is used to code age when it is not known).

I think it would help to clarify whether "N/A vs NaN" is considered a problem in the JSON files, and/or in the TSV files.

effigies commented 3 years ago

I think I agree with @robertoostenveld that there's a JSON/TSV conflation happening here. So I will address both:

JSON

From the StackOverflow that @neurolabusc posted, the "correct" JSON transcription of nan, inf and -inf is null, which will turn into a Python None. I'm not sure how MATLAB or R handle those, but making it clear that we follow this convention would allow any BIDS tooling to handle it. It does not appear BIDS says anything about missing values in JSON, probably because you can always omit them.

Reading among Chris' options, I would say option 1 is bad because it produces malformed JSON, but either of the other two would be fine. I would add a further option of normalizing to null, which has the advantage of being conformant JSON, marking that a value was present but non-finite, but the disadvantage of possibly being more complicated to tools than dropping.

TSV

To summarize my understanding of the issue:

1) "n/a" is a missing value, which is a way that curators can say "It is up to the reader to interpret this field". 2) nan is a valid value, which is a way that curators can say "this entry is intended to be a nan". 3) When people always interpret "n/a" as nan, then this can be annoying to add the interpretation step to all analysis code.

Apps may fill in nan, but they may equally validly do some interpolation. If nan can be generated for another reason, then a tool intending to interpolate either nan or there is a loss of information that can lead to incorrect behavior.

Assuming others agree with my interpretation, I'm not sure how best to describe this in the spec: There is flexibility to specify a non-finite value such as NaN, but that imposes the curator's interpretation on all downstream tools.

FWIW there are MATLAB tools to handle heterogeneous missing values: www.mathworks.com/help/matlab/matlab_prog/clean-messy-and-missing-data-in-tables.html

CPLambert commented 3 years ago

Hi there,

So @robertoostenveld JSON/TSV point is spot on. My gripe was pragmatic and arose from the .tsv side, trying to do things correct by BIDs but getting mildly irritated with n/a -> NaN conversions in tsvs being pulled from different data formats (string, numeric, genetic,imaging) as it started causing problems in downstream (non-neuro) tools, and was making things overly complicated. The wording of the BIDS guidelines seemed a tad restrictive for .tsv data, especially if I could always define my interpretation/use a data dictionary. Plus there's a lot of handy functions to handle NaNs in large matrices across languages e.g. nanmean, but less so for n/a

Regarding @effigies point, personally I look at/use NaN as "unusable data" for whatever reason (and clearly the reason will be context dependent). Admitted, I am a bit uncertain of a scenario when interpreting n/a as NaN would strictly be wrong, and a counter-example would be useful here.

fangq commented 3 years ago

Re @effigies : From the StackOverflow that @neurolabusc posted, the "correct" JSON transcription of nan, inf and -inf is null, which will turn into a Python None. I'm not sure how MATLAB or R handle those

MATLAB's jsonencode converts nan/inf/-inf to null by default, but can convert to NaN, Infinity and -Infinity if ConvertInfAndNaN is set to false (R2017b or newer)

>> jsonencode([1,nan,-inf,inf])
ans =
    '[1,null,null,null]'

>> jsonencode([1,nan,-inf,inf], 'ConvertInfAndNaN', false)
ans =
    '[1,NaN,-Infinity,Infinity]'

I am not a fan for either of these options, because the first approach loses information, and the 2nd approach produces non-compliant JSON notations.

in comparison, I've been using string forms "_NaN_", "_Inf_" and "-_Inf_" to convert these values in my JSONLab toolbox since 2011, also formalized in this specification

https://github.com/fangq/jdata/blob/master/JData_specification.md#special-constants

A similar string form was also adopted in the JSON outputs from the MATLAB Production Server, except it does not have the underscores, see

https://www.mathworks.com/help/mps/restfuljson/json-representation-of-matlab-data-types.html#bu6ii47