Open discussion RE fixel data handling

Lestropie commented 1 year ago

This is intended to be a bit of a centralizing thread where I can demonstrate how a few different proposed capabilities join together to form my vision of how I'd like to see the handling of fixel data change going forward, as well as its potential relevance to external projects.

1. `.mif` is sub-optimal for fixel data

In the development of the fixel data directory format and its utilisation in the FBA implementation, the .mif format was used for 1D fixel data files, since it:

Was already implemented in terms of back-end API
Can store sidecar header information
Can be manipulated using many existing MRtrix3 commands
Can be memory-mapped
Can manipulate strides (eg. making the direction 3-vector contiguous in memory)

However:

It contains compulsory fields (eg. transform, voxel size) that are wholly inapplicable
It (currently at least) requires constructing a 3D image with two axes of unity size, which is not a faithful representation of the 1D nature of the data
Such files can be opened in mrview, which makes no sense
While external softwares could implement / use functionality to read .mif data and therefore access fixel data, it's not necessarily "native" to such projects

Edit: Some of the annoyances here are discussed in #1664, but the focus there is on improving the use of the .mif format for fixel data rather than superseding it.

In #2437 I implemented back-end support for Python .npy files. This to me is a good candidate for storage of fixel data.

The data it stores can be 1D (one value per fixel) or 2D (could be multiple quantitative values per fixel, eg. data across participants, or it could be the N x 3 directions file)
The format does contain a Python dictionary that could theoretically be used to encapsulate header information such as command_history, though typically it's only three compulsory fields in that dictionary and I don't know whether adding other fields into it could bork other softwares attempting to read such data.
The data can be memory-mapped
The data can be row-major or column-major, providing appropriate data contiguity in the 2D case
Interfaces would exist in many other programming environments, making fixel data more accessible

Further, in #2435 I discuss how in contexts such as fixel data handling, within MRtrix3 there could be an abstraction whereby the 1D / 2D data being manipulated could be .mif, .npy, .txt / .csv / .tsv. This would mean that 1D / 2D fixel data could use any of these formats and would still be valid under the fixel directory format conventions, so retrospective fixel data would still be valid but prospectively alternative file formats would be acceptable (and IMO preferable).

2. Memory representation of GLM data

Couple of separate points in this one:

2.1. Scratch allocation of all fixel data

fixelcfestats shares much of its command-line interface and internal code structure with other MRtrix3 statistical inference commands. This includes:

Taking as input a text file that provides filesystem paths to one file per input to the GLM
Using a derived class to import the specific form of data that the code is responsible for (ie. 1D fixel data files in the case of fixelcfestats) and load that data into scratch memory. In addition to being generalized across the different commands, this is also used in the scenario of element-wise design matrix factors, which are requested on demand as each element is processed. For each element-wise regressor, an input file is provided with a list of filesystem paths, just as is done for the GLM inputs.

A disadvantage here is that if the experimental design is exceptionally large (eg. many fixels, many inputs), the scratch storage space of those data may become non-negligible. It requires a very big experiment for this to become a problem, but it's nevertheless feasible.

It would therefore be preferable in this instance to have the input data to fixelcfestats be in a form that it can be immediately memory-mapped.

2.2. Natural extra dimension of statistical inference data

For fixel data, that is represented as 1D per model input, the totality of the model data is 2D.

Similarly, for voxel data you have a 3D image per input and therefore the totality of the input is 4D, and for connectome data you have a 2D matrix per input and so the totality of the data is 3D. Now in both of these cases the data are further manipulated in preparation for the GLM (each is vectorized into a 1D stripe of data per input), and so that will likely need to be done in RAM regardless, so I'll focus here on fixel data exclusively.

2.3. Possible implementation

If one wanted to fully encapsulate all GLM data, particularly for fixel data, this is just concatenation of 1D fixel data across the second dimension.

This could be done currently with mrcat since the .mif format is used. For something like .npy, this could be done very easily in Python; or for those mr* commands where such manipulations are well-posed for 1D / 2D data (eg. yes for mrcat, but not for mrtransform), those commands could be abstracted to permit operation on such non-"image" formats.
The ordering of inputs to the model, which is currently determined by the order of the entries in the input text file, would instead be determined by the order of this explicit concatenation.
The data could be memory-mapped
The resulting file could be stored with the appropriate memory ordering such that data for one fixel across all inputs is contiguous in memory.

(Note that all of the above applies to element-wise regressors in addition to the main GLM input)

Therefore, if one ties together points 1. and 2. above, what I would envisage in the future is that:

The input to fixelcfestats would be a 2D .npy file, and could be memory-mapped in place
The directions file within the corresponding fixel directory would also be in .npy format
The fixel-fixel connectivity matrix would also present its data in .npy format

This would go quite some way to disentangling the representation of fixel data from the MRtrix3 software and its .mif image format, particularly since the index file can be stored in NIfTI-1.

I am curious to hear thoughts from both @MRtrix3/mrtrix3-devs and external invested parties, in particular those over at https://github.com/PennLINC/ModelArray (tagging listed contributors @zhao-cy @TinasheMTapera @mattcieslak @scovitz).

mattcieslak commented 1 year ago

Very exciting to see this! The npy format would be great. If you go this route it will make reading in python extremely easy, and for reading in R we would use the RcppCNPy library. Based on this package's documentation I'm not sure how flexible it is in handling python dictionaries stored in npy files. How would you feel about keeping the metadata in a json text file? That would make the data and metadata both very easy to access in python and R.

Lestropie commented 1 year ago

Theoretically we could allow users to configure automated read / write of sidecar JSON data for .npy just as we do for NIfTI. It's not as "faithful" to MRtrix3's general approach of embedding sidecar information within headers such that there's typically one file per input / output, but it's not entirely out of the ordinary (eg. .mih has technically been doing it for a good couple of decades, though that involves explicitly stating the corresponding data file name rather than inferring correspondence from common file basenames).

Alternatively, happy to hear suggestions for other file formats for nD numerical data that ideally:

Would be easy to support in C++, Python, R, Matlab
Permits memory-mapping of data
Can embed arbitrary sidecar data within the file header
Doesn't have huge scope for complexity over and above the prior points (which would increase management / sanity checking overhead at read time particularly for C++).

MRtrix3 / mrtrix3