"Processing" vs. "Conversion" & capturing relevant metadata

PeerHerholz / bids_bep16_conv

A small package to implement conversions between BEP-16 (Diffusion Derivatives) compliant datasets and other/existing software outputs.

https://peerherholz.github.io/bids_bep16_conv

BSD 3-Clause "New" or "Revised" License

0 stars 2 forks source link

"Processing" vs. "Conversion" & capturing relevant metadata #7

Open Lestropie opened 1 year ago

Lestropie commented 1 year ago

Think it's worth being explicit about the terminology here, particularly if people are going to be working on different components independently.

When I see the word "conversion" in the context of issues listed thus far, I think of taking an existing output that has been generated using some other software, and manipulating those data / generating requisite metadata in order to conform to BEP016. However in the description of those issues, what's being requested is both "processing" (ie. taking as input DWI data and from that estimating parameters of a model) and "conversion" from the native output of that external software to BEP016.

Now doing the latter in isolation is only possible if that external software yields adequate metadata to fill in the requisite / desired BIDS sidecar fields. There are some instances where this is true, others where it is not. Eg. Imagine a tensor fit using MRtrix3 (#3). I can always infer that the reference axes are xyz. However I can't infer from the output image whether the default fitting parameters (iterative weighted least squares w. 2 iterations) or user-specified parameters were used, since these are (currently) not embedded in the header information of the output image.

Therefore, I see that there are two choices:

We support default usage only. The operation of each feature request thus far can be broken into two independent steps: applying the relevant external software command to fit the model, and a secondary function that performs any requisite conversion and fills whatever metadata that can be reasonably inferred.
We support custom usage. Essentially each feature request thus far requires something akin to a BIDS App. It requires an interface where users can specify relevant command-line options, which the App would use both to forward parameters to the external software command being invoked and to fill in relevant metadata appropriately.

arokem commented 1 year ago

I agree that the myriad customized usages may prove tricky. I propose that we start by limiting ourselves to option 1. But nevertheless, I think that it would add some clarity if these were designed as little BIDS apps that take a BIDS raw dataset and produce BEP16 outputs for each of the software libraries. Meaning that the act of conversion would be one of the final steps in a pipeline. The reason I think this adds clarity is that it shows us exactly what the target software did, specifically avoiding the ambiguity of the options that were selected. WDYT?

Lestropie commented 1 year ago

I think that makes the most sense. Partly what I've been thinking about here is whether there should be an individual BIDS App for every possible conversion (bearing in mind that longer term I'd like to also have the ability to convert from BEP016-compliant data back to software native representation, so the list could get long), or whether there should be a single BIDS App, where the desired processing / conversion is specified at the command-line, and we all contribute individual capabilities to that one App, which encapsulates all dependencies for all conversions.

Lestropie commented 1 year ago

Redirecting response to comment from #10:

right now we need the processing part in order to test and implement the conversion part.

OK, I hadn't formulated it in my own head that way, but I suppose that it makes sense:

Have multiple Apps that operate on the same pre-processed DWI data and produce BEP016-compliant data.
To the greatest extent possible, take the final portion of each tool, being the conversion of the software native output to BEP016, turn it into its own little module that couuld potentially be used in isolation, and modify the BIDS App to instead use that conversion rather than having the code internal.

That strategy will inherently highlight the issue as described in the original post here. If both steps are done together, then you can manually track any metadata that you want to keep; whereas if the only input to the standalone conversion tool is input image data and nothing else, some of the metadata of interest may not be accessible. And potentially there's ways of manually providing that information to such a conversion tool. :+1:

PeerHerholz commented 1 year ago

Thx @Lestropie! Just to be sure I understand correctly: the idea would be to conduct both steps together in order to get the conversion "right", ie as good as possible re metadata, etc., and at a certain point of the development create the respective python library specific for conversions, right?

If so, the current state/structure of the code can "simply" be adapted such that we take out the processing part/module and keep the conversion part/module. In a way the current state would also support a rather focused usage re conversions with the processing part/module being simply an add-on that could be ignored if folks don't need it. Similarly, if they need it we can just add the disclaimer that they need extra dependencies (e.g. mrtrix). However, if a clear cut between processing/conversion should be applied in the library, that's also no problem.

Lestropie commented 1 year ago

Well, that's what I thought you were thinking, based on your description 😆

But yes, I think that this approach does make sense. Each of those things individually could be of use. A combined processing & conversion is necessary to get from the exemplar data to exemplar BEP016-compliant data. From there, we can look at:

Making the combined processing & conversion tools more complete and user-friendly BIDS Apps;
As far as is possible, splitting the code contained within the initial solution into the "processing" bit and the "conversion" bit. The "processing" BIDS App would then take the "conversion" tool and use it as a dependency.

The "conversion" bit in and of itself would also be useful (though that separation may introduce some difficulties, as described above). And also the "conversion" bit could over time be expanded to do things like converting from BEP016 to data that can be read by either the originating software tool or others.

PeerHerholz commented 1 year ago

Hi everyone,

while working on adding new processing pipelines for dipy and mrtrix, I kept wondering about what should be "expected" as input for the conversion tool? Obviously, we can't support an arbitrary output of processing tools/pipelines as input for the conversion (at least currently). What we said so far was that the input of the processing tool/pipeline is BIDS-common derivatives compliant dataset and I think that's a restriction we want to keep. However, as soon as we process the data via e.g. dipy or mrtrix, there are not only options to alter the processing but also to save different output files. Regarding the latter, we would need to decide on a "minimal" set of files needed for BEP16 compliance but this might differ from the default usage of the respective tools. For example, when running dipy_fit_csd in its default manner (without specifying optional arguments) I get the following output:

bids_bep16_datasets/HBN/derivatives/dipy_csd/
└── sub-NDARYM277DEA
    └── ses-HBNsiteCBIC
        └── dwi
            ├── peaks.pam5

However, when I add --extract_pam_values, I get way more files:

bids_bep16_datasets/HBN/derivatives/dipy_csd/
└── sub-NDARYM277DEA
    └── ses-HBNsiteCBIC
        └── dwi
            ├── gfa.nii.gz
            ├── peaks.pam5
            ├── peaks_dirs.nii.gz
            ├── peaks_indices.nii.gz
            ├── peaks_values.nii.gz
            └── shm.nii.gz

Comparable cases arise in other tools and analysis as well. Thus, I wanted to start/revamp the respective discussion.

Lestropie commented 1 year ago

Obviously, we can't support an arbitrary output of processing tools/pipelines as input for the conversion (at least currently)

I don't see why this can't be supported at least for the case where an App requires as input a pre-processed DWI series. Where I've done this myself the expectations have been:

No more than one DWI series
"_desc-preproc" must be in the filename
Optional "_desc-brain_mask"

Where things get trickier is if the input to some BIDS App is a BEP016-compliant dataset. It's projecting to that use case that's motivating the level of complexity I've been adding to BEP016.

However, as soon as we process the data via e.g. dipy or mrtrix, there are ... options ... to save different output files ... we would need to decide on a "minimal" set of files needed for BEP16 compliance but this might differ from the default usage of the respective tools.

I would expect that here we would want the scope of outputs to be as broad as possible using the respective tools to maximise the scope of demonstrative examples. It's much easier to derive a compliant subset than a compliant superset. If the Apps were truly "conversion" rather than "processing", it would be necessary to detect the presence or absence of such files and convert where appropriate. This would be much easier for something like the example you provide here; for eg. MRtrix, where multiple derivative files can be generated each with filesystem path under user control, this becomes considerably more challenging.

PeerHerholz commented 1 year ago

Hi @Lestropie,

thx for the reply.

I don't see why this can't be supported at least for the case where an App requires as input a pre-processed DWI series.

My question was referring to cases where users provide datasets that were already processed with dipy/mrtrix outside our App and thus might not have all the necessary files. Ie if we at some point split the processing and conversion part of the App to make it a conversion app only, then the input to the conversion might/will differ drastically between users (given that they might run the respective processing (e.g. DTI or CSD) differently). Right now we're controlling the processing and thus input to the conversion directly through the App but this might change in the future.

Where I've done this myself the expectations have been:

No more than one DWI series

"_desc-preproc" must be in the filename

Optional "_desc-brain_mask"

Yeah, we think we should definitely make clear that the App was intended to work on BIDS common derivative compliant datasets. However, we would also need to provide a list of files required for the conversion as we can't reconstruct/obtain those retroactively. Ie: "If you processed your data with dipy you should have done it the following way .... and need to provide the following files.

Lestropie commented 1 year ago

the input to the conversion might/will differ drastically between users

Yup. The command-line interfaces to such things could potentially be quite complex. Another one of the motivating factors for creating this Issue in the first place. In retrospect I may not have elucidated this in full. But for instance, one of the things I thought about in the context of MRtrix is that some provenance information can be tracked within file headers, which a conversion tool could hypothetically use to figure out any custom parameters that were used and hunt for relevant files that were used along the way. It wouldn't be bullet-proof, but there's some scope for coding cleverness.

and thus might not have all the necessary files

Arguably a feature rather than a bug. If you no longer have the information necessary to robustly describe your model fit data, then you can't represent your model fit data in a robust way.

we would also need to provide a list of files required for the conversion

Separate lists for required vs. optional. There would need to be some minimum viable data for conversion, but then the ability to capture additional "model-derived parameters" only if they are present based on some fixed filesystem path / provided explicitly by the user (hence the prior attempt at distinction between MFP and MDP...).