sheljohn commented 5 years ago

This issue follows a discussion started on the community forum. I would like to submit a proposal for an enhancement, namely the support of multiple input .tck files (as opposed to a single one currently) wherever possible.

Problem

Tractography on large datasets is typically run as parallel jobs on computing clusters. There are a number of advantages to divide the computation of a given number of tracts into smaller batches:

Size: files larger than 4GB cannot be stored on all filesystems.
Speed: exploit the parallel aspect of the problem to obtain the final results faster.
Data-safety: storing a single large file means that any data-corruption potentially leads to the loss of all tracts, whereas the risk of losing a smaller subset can be managed.
No duplication: merging several .tck files into a single one (e.g. with tckedit) requires -- at least temporarily -- double the disk-space, and this becomes problematic for large datasets; the merge operation has to be executed serially in that case.

Unfortunately, most of the commands in MRtrix (and indeed the source-code itself) do not support multiple .tck files in input.

Proposals

Wherever possible, tractography-related commands should support multiple .tck files in input. There are several ways this could be implemented in practice:

0. Variable number of arguments

As with the tckedit command for instance. I think this is a bad idea, because extending support for multiple command-line arguments disrupts the current interface of several commands; this implies a lot of replicated effort to modify each command individually, and potentially breaks backwards compatibility. I don't think this is a viable solution.

1. Built-in support for multiple tract files

This is what I would personally prefer, but it involves modifying/extending the existing source code. The idea is to introduce a new file-format with extension .lst, which contains one filename per line, and detect this extension internally in order to iterate over the files. The commands remain exactly the same, and the change does not necessarily apply only to .tck files.

Broadly, this involves a rewrite of the class MR::DWI::Tractography::Reader, to include a behaviour similar to the current implementation of MR:DWI::Tractography::Editing::Loader.

The difficult parts are:

Consensus of properties across .tck files. This is pretty much already implemented in the command tckedit.
Handling of associated weights file. This could in theory be handled either with one weight file associated with each .tck file, or with a single weight file for all .tck file. I think the first option is the easiest to implement (and it is the one I chose), but the second option might be more practical because commands called with a list of .tck files (e.g. tcksift2) would still produce a single output file.

I have started implementing this on a fork of the master branch, and should have a compiling version today. This mainly involves:

extension of src/dwi/tractography/properties.(h|cpp),
rewrite of __ReaderBase__ in src/dwi/tractography/file_base.(h|cpp),
rewrite of Reader in src/dwi/tractography/file.h,
addition of a small utility in ~core/mrtrix.h and~ core/file/utils.h,
minor modifications due to extension of Properties class in several commands.

2. Piping for streamlines

This relates to the discussion in issue #480. I could not do a better job of summarising this idea than @jdtournier and @Lestropie, so perhaps they can elaborate on my short description; but as far as I understand, this leverages the ability of the host system to stream data, in order to virtually concatenate the streamlines at runtime. I am not sure:

how this would be practically implemented (i.e. what needs to be done),
therefore I am not sure whether this can be supported on any platform (or only unices),
and whether this solves the issue of interfacing with existing commands, or whether this implies the creation of an additional file-type (as with proposal 1).

jhadida commented 5 years ago

Wrong user account, sorry! I'll keep using the wrong one to avoid reposting...

Lestropie commented 5 years ago

Given I've recently been working on #1555, my first instinct is that this same syntax could be used for streamlines data. So one would rename the individual files to conform to the necessary requirements, and then simply use the square-bracket notation when specifying the input track file at the command-line. Tractography::Reader would then be responsible for parsing the headers of all input files in order to fill Tractography::Properties with the consensus contents, and moving from one file to the next as streamlines data are loaded.

This isn't actually incompatible with providing a text file with a list of file names; it's maybe a little more consistent with existing capabilities, but conversely it's maybe a little less flexible.

411 is also relevant; particularly for piping of streamlines data, but also I suppose that in merging this with approach 1, there could then be one particular "handler" for when the input is detected as being a text file with a list of filenames, and that would then be able to invoke the appropriate handler for each track file independently in turn.

As far as the piping is concerned, my thinking was as follows:

- indicates piped data, just as it does for image data (the command-line parsing code already knows which arguments / options correspond to track / image data, so no ambiguity there).
On write, this would create a temporary file in the appropriate location & named as such, just as for piped images. This would be a .tck file, but would not include any data after the header; after having created this file, and written its filesystem location to stdout, the writer would then start dumping the raw track data on stdout.
On read, the location of the temporary file would be read from stdin, and the header of that file loaded in order to populate Tractography::Properties; this file would then be immediately deleted. The reader would then start reading binary data from stdin. The NaN and Inf delimiters in this stream are sufficient for separating streamlines & detecting end of data.

Don't see any reason why this wouldn't be constrained to Unix only. There wouldn't be any new filetype required: the - at the command-line for either read or write of track data would be sufficient.

sheljohn commented 5 years ago

@jdtournier @Lestropie

I have something working here.

There are 3 main commits:

Minor: introduce readlines utility (here)
Major: extend Tractography::properties and adapt commands (here)
Major: introduce TrackFileInfo object, properties_consensus function, and extend Tractography::Reader (here)

This code:

Compiles without warning or error.
Passes the tests using ./run_tests with the same output as the untouched MRtrix3 version cloned from the original repo (actually, both fail certain tests, but the output is the same in both cases).
Accepts .lst files in place of .tck files and seems to behave as expected with commands such as tcksift2.

Please let me know if that would be good for a PR or not :)

sheljohn commented 5 years ago

FYI, the version that was online this afternoon had a typo in it (which would have made compilation fail); I messed up something with the interactive rebase this morning, and didn't notice it until I pulled it somewhere else and tried to build it there. All should be in order now; 4 commits ahead of master, builds and tests fine.

jhadida commented 5 years ago

I went ahead and opened a PR #1569 : happy to extend / amend / retract.

Lestropie commented 5 years ago

properties_consensus function

I'm still in the process of catching up on this, but just wanted to comment on this bit specifically before I read the rest:

In #1555, I make more extensive use of the Header::merge() function to construct the "consensus" header (which is also modified in that PR in order to support its use in this way; previously that function was exclusively for managing the square-bracket notation). It might be preferable, given the functionality for handling multiple instances of Tractography::Properties is a very similar operation, to have the same functional interface.

MRtrix3 / mrtrix3

Supporting multiple .tck files with tractography-related commands #1567

Problem

Proposals

0. Variable number of arguments

1. Built-in support for multiple tract files

2. Piping for streamlines