Differentiating model fit/derived parameters

This builds on the discussions had during the recent BIDS Workshop and follows up after #68. This would be a counter-argument against bids-standard/bids-specification#1282.

I'm a bit worried about the factual encoding of provenance by using mfd vs mdp or model vs mdp, and alternatives of the sort.

Despite how reluctant BIDS is about provenance (a lot), there is some rough prescription in the BIDS-Derivatives specifications that could be of use in this particular case:

Each Derivatives filename MUST be of the form: <source_entities>[_keyword-<value>]_<suffix>.<ext> (where <value> could either be an <index> or a <label> depending on the keyword; see Definitions).

In other words, the filename must encode the critical provenance information to understand where each file comes from reasonably.

So, we can think of the "model fit parameters" of a DTI model as:

pipeline-22.1.0/
  sub-01/
    dwi/
      sub-01_fit-dti_dwi.nii.gz                  # Contains the tensors (model fit parameters)
      sub-01_fit-dti_dwi.json
      sub-01_fit-dti_mdp-FA_dwi.nii.gz           # Contains the FA
      sub-01_fit-dti_mdp-FA_dwi.json
      sub-01_fit-dti_mdp-MD_dwi.nii.gz           # Contains the MD
      sub-01_fit-dti_mdp-MD_dwi.json

Where the sidecar of the model fit parameters has:

{"FitMethod": "WLS"}

And the model derived parameters may have:

{"Units": "mm^2/sec"}

If one is particularly scrupulous and doesn't like this indirect provenance tracking given by BIDS-Derivatives (because the mdps unequivocally point at a model fit parameter file), one can find the following in the BIDS spec, which does not apply directly (because it talks about cascading pipelines) but I think is helpful:

When chaining derivative pipelines, any JSON fields that were specified as mandatory in the input files SHOULD be propagated forward in the output file’s JSON provided they remain valid.

Then, it could be suggested (and IMHO discouraged) that the model-derived parameters have the following sidecar instead:

{
  "FitMethod": "WLS",
  "Units": "mm^2/sec"
}

If bids-standard/bids-specification#1280 were accepted in some form, the proposed fit-<label> entity would not be necessary anymore:

pipeline-22.1.0/
  mdp-MD_dwi.json
  sub-01/
    dwi/
      model-DTI1/
        sub-01_model-DTI1_dwi.nii.gz                  # Contains the tensors (model fit parameters)
        sub-01_model-DTI1_dwi.json
        sub-01_model-DTI1_mdp-FA_dwi.nii.gz           # Contains the FA
        sub-01_model-DTI1_mdp-MD_dwi.nii.gz           # Contains the MD
      model-DTI2/
        sub-01_model-DTI2_dwi.nii.gz                  # Contains the tensors (model fit parameters)
        sub-01_model-DTI2_dwi.json
        sub-01_model-DTI2_mdp-FA_dwi.nii.gz           # Contains the FA
        sub-01_model-DTI2_mdp-MD_dwi.nii.gz           # Contains the MD

Where the units of the MD have been defined for all subjects at the same time (see mdp-MD_dwi.json at the top), as that doesn't seem a reasonable thing to vary between subjects (nothing you could not do without model-<label>), and it would just contain:

{"Units": "mm^2/sec"}

In the case of several "mfp" files, the proposal works well if the model-<label> directory is usable:

pipeline-22.1.0/
  mdp-MD_dwi.json
  sub-01/
    dwi/
      model-DTI1/
        sub-01_model-DTI1_dwi.nii.gz                  # Contains the tensors (model fit parameters)
        sub-01_model-DTI1_dwi.json
        sub-01_model-DTI1_mdp-FA_dwi.nii.gz           # Contains the FA
        sub-01_model-DTI1_mdp-MD_dwi.nii.gz           # Contains the MD
      model-DTI2/                                     # This is not DTI, this is the "DoubleTroubleIndex" model
        sub-01_model-DTI2_fit-all1_dwi.nii.gz         # Contains the model fit part 1
        sub-01_model-DTI2_fit-all1_dwi.json
        sub-01_model-DTI2_fit-all2_dwi.nii.gz         # Contains the model fit part 2
        sub-01_model-DTI2_fit-all2_dwi.json
        sub-01_model-DTI2_mdp-ID_dwi.nii.gz           # Contains the ID parameter

That said, I want to conclude that this far, IMHO, the only practical problems I have learned of that have been described as sourced from the lack of multiple inheritances in BIDS precluding proper metadata composition actually are provenance tracking issues.

I acknowledge the theoretical grounds of the multiple inheritances problem, but I haven't seen a practical situation where it is unavoidable. Addressing it would make BIDS less brittle (as in, a user mistakenly doing something different from what they think they are doing, assuming metadata will come from two or more files), but not for this particular reason discussed in BEP016 -- which means we could think of putting it off to BIDS 2.0.

cc/ @arokem @Lestropie @francopestilli @effigies @poldrack

BTW, depending on #68 and upstream, please mentally replace the suffix _dwi with anything you like (except _mfp/_mdp and variants because then this doesn't make sense).

Then, it could be suggested (and IMHO discouraged) that the model-derived parameters have the following sidecar instead:

I've thought about that and not liked the outcome terribly. If fields like "FitMethod" were encapsulated in a dict dedicated to "model input / fitting parameters" it wouldn't be quite so bad, but maybe it's better that any MDP only have metadata relating to how they were derived from the model fit (eg. I'm thinking of something like extracting discrete fibre directions from SH ODFs, which can be done a couple of different ways), and greater upstream detail deferred explicitly to provenance.

That said, I want to conclude that this far, IMHO, the only practical problems I have learned of that have been described as sourced from the lack of multiple inheritances in BIDS precluding proper metadata composition actually are provenance tracking issues.

I think the downstream distinction between model fit and model-derived has obscured the upstream origin of the multiple inheritance problem; the latter can pop its head up when discussing the former, but they're different issues.

Let's take the example explained in #50, but remove the fit / derived parameter distinction:

"We have a hypothetical DWI model called ABC. This model is represented using parameters X and Y. X and Y are of fundamentally different data types, such that it is not possible to store both in a single NIfTI image, and they must be split across multiple images. For metadata, there is information that is relevant to model ABC as a whole, and there is additionally information that is specific to parameter X and parameter Y separately."

For "information that is specific to parameter X and parameter Y separately", think about "OrientationRepresentation" and "ReferenceAxes". Within a single model, you could have a combination of different representations; e.g ball-and-sticks has some scalars and some discrete fibre orientations. The presence / values of these metadata fields is therefore different between the images used to store those data. But information about the model and the input parameters that affect fitting of the model are the same between the two. So ideally you want both NIfTIs to inherit the same JSON that encodes model name / URL / input parameters, but each to then have their own JSON that encodes the presence & representation of any orientation information.

I think the downstream distinction between model fit and model-derived has obscured the upstream origin of the multiple inheritance problem; the latter can pop its head up when discussing the former, but they're different issues.

Agreed. What I'm bringing up here is that the multiple inheritance problem is not an urgent one to solve IMHO and could be pushed to BIDS 2.0 where breaking backward compatibility with BIDS 1.0 should not be a problem. I also argue that I believe we can finalize this BEP without any substantial changes to the current BIDS specs.

As a result of the above thinking, the core of this proposal is to use BIDS (and BIDS-Derivatives) as already accepted by the community and encode derived parameters as derivatives themselves of the fit model (hence, using the <source_entities>_key-value mechanism).

"We have a hypothetical DWI model called ABC. This model is represented using parameters X and Y. X and Y are of fundamentally different data types, such that it is not possible to store both in a single NIfTI image, and they must be split across multiple images. For metadata, there is information that is relevant to model ABC as a whole, and there is additionally information that is specific to parameter X and parameter Y separately."

I think this is the wrong approach. Instead of going from abstract (hypothetical) to concrete, I prefer to follow the BIDS workshop's motto (and @francopestilli's suggestion) and go from specific to general.

This BEP has been stalled by inactivity mostly (one principal actor of that inactivity being myself, apologies), but also for trying to establish solutions that require substantial modifications to the rest of the BIDS specs. In other words, let's take the most common DWI models (e.g., everything currently supported by MRTrix and DIPY) and prescribe how to store them. If we hit such a difficult situation that there's no way around having two suffixes or inheriting from multiple files, then go for the bigger fish. But at this point, I'm pretty sure the only way we finish this BEP in a reasonable timeframe is by abiding by the current BIDS principles and definitions.

One of the reasons for the success of BIDS is the very scoped advance of the early drafts. First, let's solve fMRI (because, lest we forget, everything started from OpenfMRI), then a few more modalities received support, and now we have EEG or PET. If BIDS attempted to capture everything from the beginning, I agree it would be hypothetically much more harmonic, consistent, and beautiful, but it would not exist at this point.

+1 to focusing in on some specific commonly-used examples to make progress.

On Wed, Sep 21, 2022 at 12:20 AM Oscar Esteban @.***> wrote:

I think the downstream distinction between model fit and model-derived has obscured the upstream origin of the multiple inheritance problem; the latter can pop its head up when discussing the former, but they're different issues.

Agreed. What I'm bringing up here is that the multiple inheritance problem is not an urgent one to solve IMHO and could be pushed to BIDS 2.0 where breaking backward compatibility with BIDS 1.0 should not be a problem. I also argue that I believe we can finalize this BEP without any substantial changes to the current BIDS specs.

As a result of the above thinking, the core of this proposal is to use BIDS (and BIDS-Derivatives) as already accepted by the community and encode derived parameters as derivatives themselves of the fit model (hence, using the _key-value mechanism).

"We have a hypothetical DWI model called ABC. This model is represented using parameters X and Y. X and Y are of fundamentally different data types, such that it is not possible to store both in a single NIfTI image, and they must be split across multiple images. For metadata, there is information that is relevant to model ABC as a whole, and there is additionally information that is specific to parameter X and parameter Y separately."

I think this is the wrong approach. Instead of going from abstract (hypothetical) to concrete, I prefer to follow the BIDS workshop's motto (and @francopestilli https://github.com/francopestilli's suggestion) and go from specific to general.

This BEP has been stalled by inactivity mostly (one principal actor of that inactivity being myself, apologies), but also for trying to establish solutions that require substantial modifications to the rest of the BIDS specs. In other words, let's take the most common DWI models (e.g., everything currently supported by MRTrix and DIPY) and prescribe how to store them. If we hit such a difficult situation that there's no way around having two suffixes or inheriting from multiple files, then go for the bigger fish. But at this point, I'm pretty sure the only way we finish this BEP in a reasonable timeframe is by abiding by the current BIDS principles and definitions.

One of the reasons for the success of BIDS is the very scoped advance of the early drafts. First, let's solve fMRI (because, lest we forget, everything started from OpenfMRI), then a few more modalities received support, and now we have EEG or PET. If BIDS attempted to capture everything from the beginning, I agree it would be hypothetically much more harmonic, consistent, and beautiful, but it would not exist at this point.

— Reply to this email directly, view it on GitHub https://github.com/bids-standard/bids-bep016/issues/69#issuecomment-1253306199, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUVEHWHK3T6HDD52VMIB3V7KZLJANCNFSM6AAAAAAQQOP6IA . You are receiving this because you were mentioned.Message ID: @.***>

-- Russell A. Poldrack Albert Ray Lang Professor of Psychology Associate Director, Stanford Data Science Director, SDS Center for Open and Reproducible Science Building 420 Stanford University Stanford, CA 94305

@. @.> http://www.poldracklab.org/

Writing this comment on PeerHerholz/bids_bep16_conv#3 made me think of an interesting alternative perspective on the MFP / MDP distinction.

I've spoken about how the distinction here is that the latter can be regenerated from the former, but regeneration of any of the former would necessitate re-performing the model fit to the empirical data. The retort to that is that it's more likely that a single BIDS App will both fit the model and generate a bunch of derivatives of such, and save them all into a single dataset, and that users shouldn't be manually interacting with the contents of the BIDS Derivatives dataset to eg. generate additional quantitative images (ie. the BIDS Derivatives dataset should be considered immutable).

There's an implicit assumption within this argument regarding the BIDS App interface and the grouping of processing steps. Currently, you run a single "participant" level analysis, on a BIDS Raw dataset, and get a derivatives dataset with a large set of files. In the context of eg. PeerHerholz/bids_bep16_conv#3, we've been proceeding on the assumption that execution of a single analysis level of a single BIDS App will produce both model fit and derived parameters and save them all together, and we've debated whether they necessitate distinction in naming within that single dataset. We're also assuming that something like eg. tractography is such a drastic change in computation that it shouldn't be encompassed within the same App (indeed not even within the same BEP).

But consider the following two hypothetical implementations:

There are two separate BIDS Apps:
1. The first takes as input a BIDS Derivatives dataset containing pre-processed DWI data, fits the diffusion model to the data, and saves in its output dataset the model fit parameters.
2. The second takes as input the output dataset of the first, calculates from the model fit various derivative parameters, and saves those model-derived parameters in its output dataset.

(Note that this echoes the modular command structure philosophy of MRtrix3, where dwi2tensor and tensor2metric are different commands)

There is a single BIDS App, but unlike the constraints of "BIDS Apps 1.0", where one is limited to a single "participant"-level analysis, there are multiple different analysis levels that operate on individual participants.
1. "fit"-level analysis takes as input a BIDS Derivatives dataset containing pre-processed DWI data, fits the diffusion model to the data, and saves in its output dataset the model fit parameters.
2. "metrics"-level analysis takes as input a BIDS Derivatives dataset containing a model fit (which may have come from its own "fit"-level analysis), calculates from the model fit various derivative parameters, and saves those model-derived parameters in its output dataset.

Both of these:

Defer the distinction between MFP and MDP from filenames to dataset directory locations.
Unlock what IMO should be a key power of standardizing the way in which such data are stored. It should be possible to compute the tensor from the data using one software package, and compute the quantitative metrics from that tensor fit using another software package, and for all of the data at all stages to be BIDS compliant. This might seem pointless for the tensor since there's no ambiguity in how the various metrics are calculated, but for other models there are. Or you could take a model calculated by one package, standardise its representation via BIDS, then load it via some other package and compute metrics from it that were intended for one model but are in fact entirely applicable to the other (eg. I'm thinking of the commonality between ball-and-sticks and fixel-based derivatives of CSD).

I'm not necessarily advocating for such a solution. But I do think it illuminates the nature of the problem.

BEP is currently proceeding on the premise of not differentiating between model fit and model-derived parameters; major change in #92 following https://github.com/bids-standard/bids-specification/issues/1602. If that distinction is to be useful in any other context, eg. see https://github.com/bids-standard/bids-specification/pull/1282, it would be better placed outside of the context of BEP016.

bids-standard / bids-bep016

Differentiating model fit/derived parameters #69