MRtrix3 / mrtrix3

MRtrix3 provides a set of tools to perform various advanced diffusion MRI analyses, including constrained spherical deconvolution (CSD), probabilistic tractography, track-density imaging, and apparent fibre density
http://www.mrtrix.org
Mozilla Public License 2.0
291 stars 179 forks source link

Fixels: separate size & value fields (un)necessary? #317

Closed thijsdhollander closed 9 years ago

thijsdhollander commented 9 years ago

So it seems there's this heated debate that got (re)kindled in #302 , after @jdtournier brought up this:

I notice you mention the fixel size & value as distinct entities - I remember we'd discussed this a while back, although I can't find the thread. Didn't we come to a consensus that having separate size & value fields per fixel was unnecessary? I thought we were getting rid of one of them, since it's currently unused (as far as I know?). What are the use cases where we need the additional entry? And are they common enough to warrant adding UI controls for them (along with a good deal of confusion)? Personally I think this needs to be kept as simple as possible, we can always revisit later if we come up with a killer app that needs it - and if and when we do, we'd probably need to come up with a better/cleaner way to handle these data UI-wise...

@draffelt and @Lestropie indicated that there's a history to the debate, and brought up more arguments than I was already able to read in the mean time; all can be found in #302 up to now as well. I feel this is a very important discussion, and we still have the unique momentum of the "ISMRM" release going on. We really should get this right before the release. It's more than a file format or a class structure or a "set of fields", etc... This is really about what a fixel image _is_, and thus how it will be perceived and understood by our users. We'd be naive if we didn't see that or take it into account. This is not just our internal software playground any more, it's something we're seriously putting out there for the whole community to see, use, interpret, ...

I can go on for hours like this, but I think it's pretty clear at this point I feel strongly about getting this right. :wink:

So it deserves its own issue!! :loud_sound:

I'll be adding in opinions from my point of view as well at some point. This one is just to "create the issue". I guess it's best to use it rather than get useful arguments/information lost in #302 ...

thijsdhollander commented 9 years ago

K guys, this debate is so heated I can simply not miss out on all the fun. :wink: I've only browsed the arguments in the other issue super diagonally, but I just want to raise some things already so I simply don't forget about them (I'm in the middle of a super hectic house move so I'm cut short on time sometimes). I do think we should handle this one professionally: it's not about racing to be the first with arguments and blasting each other with lengthy posts (although I'm personally awfully good at the latter...). Quickly pushing things towards a consensus because of want-to-win is not going to yield the best solution for MRtrix.

So first things first:

Second things second, this is actually my main one for now:

Third things third, just small things:

That's it for now (since time constraints). I'll try to do a bit more than just diagonal reading into all of you guys' comments later, and see if I could add things to the discussion based on that.

draffelt commented 9 years ago

So unfortunately the initial discussion we had on Gitlab has been lost. It was an issue tied to the fixel_based_stats repository that was deleted when I merged the code in the the main repo. Rob and Donald, you can probably piece it together from your email if you were receiving alerts, however it's probably not worth copy and pasting in here since the gist is the same.

On Wed, Aug 5, 2015 at 7:51 PM, Thijs Dhollander notifications@github.com wrote:

K guys, this debate is so heated I can simply not miss out on all the fun. [image: :wink:] I've only browsed the arguments in the other issue super diagonally, but I just want to raise some things already so I simply don't forget about them (I'm in the middle of a super hectic house move so I'm cut short on time sometimes). I do think we should handle this one professionally: it's not about racing to be the first with arguments and blasting each other with lengthy posts (although I'm personally awfully good at the latter...). Quickly pushing things towards a consensus because of want-to-win is not going to yield the best solution for MRtrix.

So first things first:

  • I cannot play broken records, because I'm the new record in town. [image: :relieved:]
  • I'm just going to be blatantly honest/clear all over the place; don't take it personally. It's just passion for science/MRtrix sake if you will. Better to be clear than be bothered with subtle wording.
  • Sorry if I repeat some of the previous arguments in my first entry to this discussion; I admit I have only read the previous arguments diagonally up till now. From another point of view: this post is my most "unbiased" opinion (but incomplete, since lack of time) if you will.
  • To be honest already up front: after realising this issue, seeing how it's currently done, taking into account what was in my mind about the whole fundamental concept of fixels before (based on how I learned the concept through whatever publications/presentations/materials we have already out there towards the community, before I was actually a part of that "we" if you will), and letting it all sink in for a bit... my opinion is that a fixel image should only have 1 value associated with each fixel (apart from the fixel's angle, which is basically its equivalent of the position that a voxel has). And to make it even more clear: I think it should be called "value". (not "size")

Second things second, this is actually my main one for now:

  • I have always perceived and understood the concept of a "fixel" as an "equivalent", or maybe rather "generalisation", to/of a voxel & a pixel. It just adds another dimension, but one that is not of fixed length and can not simply be indexed with the same meaning as compared to the other 3 spatial dimensions. Classical pixel/voxel images just have a value associated with each pixel/voxel position. We don't tell the user what that value is meant to be. You can have an FA map (an FA voxel image, if you will), and the value is FA. You can have an ADC map, and the value is ADC. You can have a DEC map, and the "value" is an RGB triplet (even though that one needs 3 numbers, they're really only meaningful together, never alone; the triplet is the "value"). Fixels allow for positions that go beyond 3 spatial coordinates; but that's it (for me). They extend the positioning mechanism of a voxel image. I feel we should not tell the user what actual dat a is sto red in them. It's up to the user to be creative with what "value"'s he/she can come up with, that are fixel specific; and are thus best stored in a fixel image. One (fixel) image for one (fixel) value that is. Giving or suggesting a specific meaning for a value (e.g. calling it "size") is overplaying our assumptions on what users/scientists are able to come up with. Same thing for having exactly 2 values. Maybe we can envision an application/meaning for them now; but what if I (or they, out there) come up with a 3rd other-kind-of-value then? So yeah, we happen to get a "size-ish" kind of thing extracted from FODs now. But you can even question if the fixel concept is even bound to, or in need of the FOD concept (strictly speaking). Don't forget what is out there (on the very wiki if you will): a fixel is "a specific fibre bundle within a specific voxel". So all it needs is a concept of bundles going through voxels. You can cluster/segment or in gen eral ext ract your notion of bundles-in-a-voxel by doing FOD segmention; but you don't have to. Maybe you want to cluster streamlines going through the voxel, based on where they go, and get your "fixels" in that voxel based on that. No sensible "size" pops up in that scenario necessarily as well. So before I end up dwelling, I think this big argument is quite clear now. [image: :smile:]

Third things third, just small things:

  • I don't think the whole "status quo" kind of reasoning is a strong argument at this point. There's been bigger and potentially much more "annoying" changes in the past as well... the SH basis change comes to mind here. I'm happy that one got pushed through, even though it cost a lot of hassle. But it's there now, for the future (for the children! [image: :stuck_out_tongue_winking_eye:] ). We're still before the release, this is the moment to be critical of what ends up being the release. We're (brave) scientists, we shouldn't be afraid to change things for the better.
  • If this issue/question didn't get an opportunity to be openly discussed in the past, there's even more of a reason to do so now. I'm sure there was some good thinking of the designer back then, but I'm also convinced we can do an even greater amount of thinking by combining all our (awesome) minds now. [image: :wink:]

That's it for now (since time constraints). I'll try to do a bit more than just diagonal reading into all of you guys' comments later, and see if I could add things to the discussion based on that.

— Reply to this email directly or view it on GitHub https://github.com/MRtrix3/mrtrix3/issues/317#issuecomment-127941161.

Lestropie commented 9 years ago

I have always perceived and understood the concept of a "fixel" as an "equivalent", or maybe rather "generalisation", to/of a voxel & a pixel. It just adds another dimension, but one that is not of fixed length and can not simply be indexed with the same meaning as compared to the other 3 spatial dimensions.

(not quoting the whole paragraph for the sake of brevity)

This is the concept behind the sparse image format: Allowing a variable number of elements along a particular dimension. Not specific to fixels. A crucial aspect of the implementation of the image format is that the underlying class is templated: You can use any data storage class you choose (and even use member functions if you choose to); the only restrictions being that the class must be of a fixed size (i.e. no pointers to more data), and there's no cross-endian-ness compatibility. The image format checks to make sure that the class you are expecting to see in the image for your particular command is in fact the class stored in the data, and that the endian-ness is correct. So in that respect there's no 'assumptions on what users/scientists are able to come up with' (and I put the extra effort into designing it that way precisely for that reason).

What's at debate here is whether or not the FixelMetric class specifically, which is currently the only sparse image class in use (but won't be in the future), should or should not have a 'size' field.

Since having a large number of different fixel classes would not be preferable, I designed an initial fixel class that would cover a large number of initial use cases: per-subject FOD segmentation, fixel-based statistics (though in retrospect it's not ideal), fixel-based TWI, initial implementation of my fixel-based tracking, ... Some of these don't need the size field, some could potentially benefit from having it (e.g. more sophisticated fixel matching in group statistics), some already use it (e.g. fixel2voxel, fixel TWI when it gets merged), some are almost impossible without it (e.g. visualizing single-subject fixel images). But since it's a lot easier to have data and not need it than to not have data and need it, I included the size field knowing that it would be redundant in some applications but useful in others.

If you really wanted to, you could create a new fixel class: one that contains only an XYZ triplet and a value; and the sparse image format would handle it seamlessly. It would do exactly what you want it to: store just one value for each fixel, without that pesky extra 'size' field (that for whatever reason is so damn offensive to everyone, even though storage size is apparently not a factor). That way, all the functionalities I have implemented and planned can still be used / done using the existing FixelMetric class, and we don't have image backwards compatibility problems.

But then you'd have to justify to the user base why there are two fixel classes; one being a fractional subset of the other, that provides no benefit other than a 20% reduction in size, at the expense of increased command complexity and compatibility problems. (Which, by the way, if the size field were removed from the FixelMetric class, this is precisely what I'd do in order to restore the functionality that would be lost by said removal.)

Alternatively, if all people want is a direction and a value, you might as well just use XYZ images with the vector length providing the value and leave the fixel class untouched. As we've all agreed, storage size isn't an issue, so just NaN-pad a 4D float image.

Maybe you want to cluster streamlines going through the voxel, based on where they go, and get your "fixels" in that voxel based on that. No sensible "size" pops up in that scenario necessarily as well.

... You mean, apart from the streamlines density? Because the 'value' field is in no way guaranteed to be the streamline count, it could be anything; and the streamline count (aka the 'size') may be useful in a range of contexts. The fixel TWI branch already does this.

... status quo...

FFS, I put one silly comment in there and you both latch on to it like it spells the death of my entire argument.

The FixelMetric class as it stands right now contains a size class. To remove it after the fact requires re-coding, changing command interfaces, removing (used!) capabilities from the viewer. Therefore the burden is on providing an adequate argument on why it should be removed, because that's the course of action that requires effort and justification. And you should have to prove to me why that change is so important that it's worth breaking things I've implemented and preventing work I intend to do.

We're (brave) scientists, we shouldn't be afraid to change things for the better.

No argument there. But what's 'better' about a fixel class without a size field? What are you gaining? Apart from perceived philosophical consistency, all I see is precluding current and future functionality for no benefit. And that philosophical argument to me is mis-placed from the outset: we are in diffusion MRI after all, so I'll just repeat this here (with changes):

Fixels represent an atomic fibre population element. Almost every single crossing-fibre diffusion model out there will attribute some form of 'density' / 'volume fraction' to each fibre population in a voxel. This is an intrinsic property of the fixel that comes from the diffusion model, independently of any other parameter that may be investigated (but the same argument holds if the fixels are derived from e.g. streamlines). If the parameter of interest happens to be the fixel size, so be it: the redundancy between those two fields makes sense. If you don't make use of the density in your processing, fine; it'll just sit there unused. But no matter what processing you apply to your fixels, there will always be some underlying measure of density associated with each; it's as fundamental as the direction. And sometimes it might just come in handy... so why go to such great lengths to avoid keeping it?

Lestropie commented 9 years ago

Since I now can't sleep, I'm adding this in case it helps get people on the same page.

The point of the sparse image format is not to merely extend 3D images to permit a variable number of values per voxel. It allows you to store a variable number of instantiations of a data structure per voxel. This can be anything, including data types other than float, and does not have to correspond to a fixel, or even store direction information.

But given we're using this format to deal with fixel data, the question becomes: What data do we want to store per fixel? We don't want to have too many different classes used in sparse images, as that will introduce a lot of code and compatibility issues. Conversely, we don't want to have such a small class that every time we derive a new application, we need to create a new class, because the previous one(s) don't give the support we require. So you make a compromise.

To me, the current FixelMetric class is the perfect compromise: it supports all current applications and many future ones, without having so much extraneous data so as to make it unwieldy in applications that don't use all of its features. Future applications for which it is not adequate require so much additional data that creating a separate class will then makes sense. Deleting the 'size' field will break this balance.

This isn't about 'what is a fixel'. It's about storing an appropriate amount of relevant data per fixel, so as to minimize the number of different sparse image storage classes we have to support. Which right now is one, but if 'size' is deleted will become two because I'll have to re-implement FixelMetric and call it something else to get back the functionality that would be lost by said deletion.

jdtournier commented 9 years ago

Man, I'm trying to enjoy my beach holiday in Greece here, keep the noise down!

Seriously though, if I'd known my comment was going to cause such an outpouring of emotion, I might have delayed until I was back at my desk... Instead of which I've had to go to the phone shop to get myself connected properly, just so I can add my 2 cents to the discussion. Anyway, I'll try not to add too much oil on the fire in the process. I've also tried to keep it brief, but as usual that didn't really work out - better grab yourselves a brew...

So to start off with, I guess we've all had different ideas about what the sparse format should be, and on top of that about how fixel data should be stored. I remember arguing that the sparse format should be sparse in all dimensions, not just the 4th. But then that wouldn't really impact the current discussion... The point is, we've all got our own ideas, and they've all got their own merits. This would be a lot easier to deal with if these discussions had been taken to completion before all this codebase evolved around the current implementation, but at the same time we all have to realise that sometimes the best (and possibly only) way to assessing how good a solution is is simply to try it and see. MRtrix is where it is now by just such a process (I think I'm probably at iteration 5 or 6 by now...) - just witness the current updated_syntax branch if you need any further demonstration of the ongoing evolution of MRtrix... Anyway, at least we have a solution in place now that fill most of our current needs, although there clearly is a lot of scope for improvement on many levels.

Right, onto the specifics. There's a lot of open questions, I'll try to address them in a sensible order...

What should a fixel image contain?

I have to admit, initially I was leaning towards the much simpler interpretation that @thijsdhollander spelt out very clearly: a fixel image contains a value per fixel. What that fixel value represents is up to the user (or at least the software that generated the image), and the filename can be used to identify this information appropriately.

But having heard @Lestropie's arguments for the fixel size having a special meaning as the weight of the fixel, I'm no longer so sure. When dealing with voxels, there is no ambiguity as to the volume each voxel represents. But now we're breaking each voxel up into corresponding fixels, it's clear that each fixel no longer represents comparable 'amounts' of stuff. As Rob says, these weights can be used in various contexts in addition to the value of interest (e.g. the fixel2voxel command). A large part of my initial reluctance to the having both a size and a value was about the vagueness of the terminology - it was not clear to me which field represented what, and why you'd need two of them. I still think it's not sufficiently clear which is which, and for new users, why both would be required. We've already debated whether size was the right term before, and I don't think we've settled the issue... But at least I can now properly understand the rationale for having an explicit size/weight/volume.

But then, as @draffelt pointed out, this is still nothing that can't be done by providing both an image of the weights, and an image of the values. So whether we give the 'weight' such a prominent role (as is currently the case) remains more about ideology (for want of a better word) than actual capabilities. We can do everything we need with a simplified fixel format, and I personally would agree that having each fixel image refer to a single quantity keeps everything much more intelligible for Joe User (myself included), since it would (or at least should) be clear which operations require the weights and why.

The obvious downside to having one value per file is the massive amount of redundancy this leads to, in terms of the storage of the fixel mapping. This is I guess @Lestropie's main objection to this - why use a sparse format to then replicate most of its information across many different fixel files? I agree, but it's also clear that getting around this requires a little bit more complexity - we've already discussed that one to death before, but I guess we need to go for one more round...

What information needs to be stored in fixel images?

So let's go back to basics: what we actually need here is a 2-level relational database: first level maps each voxel to a list of corresponding fixels, second level maps each fixel to its corresponding values. We're essentially trying to set up our own database to store our fixel information... There is no simple fix for this, it's always going to be a bit more complex than we'd like. We have to try to find a solution we're all happy with that remains as brain-dead simple as possible, yet flexible enough for our current and future needs.

My first observation here is that there will be many values we want to store per fixel, and this number is going to vary between different studies and applications. From this point of view, storing 2 variables seems very odd - even if one of them (the size) can be argued to be more fundamental than the others. If we stored just one value per fixel, that wouldn't seem strange. If we could store any number of variables, that would be fine too (although that would introduce its own problems, see later). But two just seems plain odd to newcomers (and clearly to others too...), and even though the logic behind that decision does make sense once explained, I'm not sure I agree that it makes enough sense...

So the current argument about removing the size field is not just about keeping things simple, it's actually more about maintaining clarity - which file contains what information. Personally, I think that's a very good argument, and one I'd be in favour of.

The alternative approach is to store multiple variables per fixel (not just 2). The problem then is one of visibility into what data are actually stored in these files. We'd need to somehow make it clear which entry corresponds to what data, most likely through header entries, and design the UI in such a way as to expose this information to the user to allow them to choose which they'd like to use for what aspect of the display - quite a bit of work. Not to mention the fact that adding more entries (e.g. if you redo the stats using some other contrasts) would be ridiculously messy...

So I'd personally lean towards having one file per value, and rely on the filesystem to provide the naming and 'growable' aspects of our database system. But like @Lestropie, I don't want this to be at the expense of negating all benefits of sparse storage. The obvious and only solution to this is to have a file that defines the fixel mapping (basically @Lestropie's FixelIndex format), and store the values themselves in raw flat binary files (I don't see the point in CSV files here, more on that later). The fixel mapping would basically provide the offset into the data to retrieve the relevant data, much like it currently does. But making the mapping separate from the data allows each data file to maintain a reference to its mapping, and so keep redundancy down to a minimum. Again, this is something that we'd discussed before, but couldn't really come to an agreement about.

The sparse image format

So I think this prompts a re-examination of the sparse image format. I'm in the middle of updating the code for this to merge into the updated_syntax branch, so I think a discussion about this is very timely. There's a few points that I'd like to raise. The first is that currently, the sparse format is designed to take essentially a memory-dump of arbitrary structures (or objects). This means it is utterly un-interpretable by anything that doesn't already have knowledge of this structure. I appreciate the aim of keeping the format generic and very flexible, but ultimately I think it only make usage more obscure. In fact, I think this actually makes the format incomplete - a bit like vendors adding private tags to DICOM images. I would much rather see a format that restricts the data stored to a fixed number of pure types (int32, float32, etc), just the same as standard images. I'm not sure what applications there are currently for the sparse format that require storing data as distinct types for the same fixel, but I expect just about anything would be able to make use of a format whereby each fixel contains a fixed number of values of the same type. This could be a single float for the size or t-value, or a 3-vector of uint8 for an RGB triplet, or a full vector of floats for the SH coefficients of a representation of that FOD lobe. I suspect anything that needs to store more complex information than that would benefit from splitting that information across different files, so that each file can be named appropriately, typed appropriately, and used independently in different contexts. Complex applications that would have needed to use the full-blown class aspects of the current framework would just need more inputs.

How do we proceed from here

Clearly, the simplest thing to do is simply to carry on using the existing infrastructure, with all the frustrations that this would entail for all parties involved. Although @Lestropie is currently arguing that the size field should stay for various reasons, I reckon he'd also agree that the current (necessary) practice of storing a whole new fixel image (i.e. including the full fixel mapping) for each output is not very satisfactory. Those arguing that the size field is more confusing than anything else will keep getting mildly annoyed by its remaining there, and newcomers will keep getting puzzled about this.

The FixelIndex idea is more in line with what I had in mind the last time we discussed this idea, and I think there are now very compelling reasons to try an approach of that type. Where I'd disagree with @Lestropie is in the details: I don't think the size field should be given any special treatment, and I don't think we should support the kind of arbitrary data structures currently allowed by the sparse format. I also don't quite see the rationale for storing the data using CSV files - surely any application that realistically wants to use the information in those files also needs access to the fixel mapping? This would negate any benefit of making the data file readily parsed by 3rd party software, since a crucial part of the format would remain specific to MRtrix.

So I propose we consider an approach like this:

I think this has the advantage of simplicity, clarity, compactness, and being fully defined, which should tick most of the right boxes for most people in this discussion. The issues that are likely to arise if we do this would relate to the amount of coding effort required to do this, ongoing support for the current format, some concerns about problems should people delete or neglect to backup the fixel map image (and related issues), and some potential future applications @Lestropie mentioned, namely handling mismatches between fixel mappings. So about these concerns:

Anyway, I'll stop now. By all means let's keep discussing this, but FFS let's please keep this civil - there's no need to get personal about this or lose any sleep over it. There's nothing wrong with healthy debate, and we'd all be shitty scientists if we didn't engage in it. I'm personally really proud of the fact that we're happy to have such a frank discussion about the very guts of our latest and greatest developments in such an open forum, and I think so should all of you.

draffelt commented 9 years ago

I think the solution you proposed is a good compromise. I'm happy as long as we can view a colour-mapped fixel file in the viewer by loading a single image. Hopefully this should be possibly if the data file references the index, and the index always as a reference to the directions. It would be a pain to have to load the directions first, then colour by having to load the data file. If we really want to support dynamic thresholding by one file, then colouring by another, we could still have an option to right click and 'colour by another fixel file'. Similar to what is done for TSF files.

For simplicity I'd be happy to ditch the existing fixel format for release and provide a conversion tool.

On Fri, Aug 7, 2015 at 6:35 AM, J-Donald Tournier notifications@github.com wrote:

Man, I'm trying to enjoy my beach holiday in Greece here, keep the noise down!

Seriously though, if I'd known my comment was going to cause such an outpouring of emotion, I might have delayed until I was back at my desk... Instead of which I've had to go to the phone shop to get myself connected properly, just so I can add my 2 cents to the discussion. Anyway, I'll try not to add too much oil on the fire in the process. I've also tried to keep it brief, but as usual that didn't really work out - better grab yourselves a brew...

So to start off with, I guess we've all had different ideas about what the sparse format should be, and on top of that about how fixel data should be stored. I remember arguing that the sparse format should be sparse in all dimensions, not just the 4th. But then that wouldn't really impact the current discussion... The point is, we've all got our own ideas, and they've all got their own merits. This would be a lot easier to deal with if these discussions had been taken to completion before all this codebase evolved around the current implementation, but at the same time we all have to realise that sometimes the best (and possibly only) way to assessing how good a solution is is simply to try it and see. MRtrix is where it is now by just such a process (I think I'm probably at iteration 5 or 6 by now...) - just witness the current updated_syntax branch if you need any further demonstration of the ongoing evolution of MRtrix... Anyway, at least we have a solution in place now that fill most of our current needs, although there clearly is a lot of scope for improvement on many levels.

Right, onto the specifics. There's a lot of open questions, I'll try to address them in a sensible order...

What should a fixel image contain?

I have to admit, initially I was leaning towards the much simpler interpretation that @thijsdhollander https://github.com/thijsdhollander spelt out very clearly: a fixel image contains a value per fixel. What that fixel value represents is up to the user (or at least the software that generated the image), and the filename can be used to identify this information appropriately.

But having heard @Lestropie https://github.com/Lestropie's arguments for the fixel size having a special meaning as the weight of the fixel, I'm no longer so sure. When dealing with voxels, there is no ambiguity as to the volume each voxel represents. But now we're breaking each voxel up into corresponding fixels, it's clear that each fixel no longer represents comparable 'amounts' of stuff. As Rob says, these weights can be used in various contexts in addition to the value of interest (e.g. the fixel2voxel command). A large part of my initial reluctance to the having both a size and a value was about the vagueness of the terminology - it was not clear to me which field represented what, and why you'd need two of them. I still think it's not sufficiently clear which is which, and for new users, why both would be required. We've already debated whether size was the right term before, and I don't think we've settled the issue... But at least I can now properly understand the rationale for having an explicit size/weight/volume.

But then, as @draffelt https://github.com/draffelt pointed out, this is still nothing that can't be done by providing both an image of the weights, and an image of the values. So whether we give the 'weight' such a prominent role (as is currently the case) remains more about ideology (for want of a better word) than actual capabilities. We can do everything we need with a simplified fixel format, and I personally would agree that having each fixel image refer to a single quantity keeps everything much more intelligible for Joe User (myself included), since it would (or at least should) be clear which operations require the weights and why.

The obvious downside to having one value per file is the massive amount of redundancy this leads to, in terms of the storage of the fixel mapping. This is I guess @Lestropie https://github.com/Lestropie's main objection to this - why use a sparse format to then replicate most of its information across many different fixel files? I agree, but it's also clear that getting around this requires a little bit more complexity - we've already discussed that one to death before, but I guess we need to go for one more round...

What information needs to be stored in fixel images?

So let's go back to basics: what we actually need here is a 2-level relational database: first level maps each voxel to a list of corresponding fixels, second level maps each fixel to its corresponding values. We're essentially trying to set up our own database to store our fixel information... There is no simple fix for this, it's always going to be a bit more complex than we'd like. We have to try to find a solution we're all happy with that remains as brain-dead simple as possible, yet flexible enough for our current and future needs.

My first observation here is that there will be many values we want to store per fixel, and this number is going to vary between different studies and applications. From this point of view, storing 2 variables seems very odd - even if one of them (the size) can be argued to be more fundamental than the others. If we stored just one value per fixel, that wouldn't seem strange. If we could store any number of variables, that would be fine too (although that would introduce its own problems, see later). But two just seems plain odd to newcomers (and clearly to others too...), and even though the logic behind that decision does make sense once explained, I'm not sure I agree that it makes enough sense...

So the current argument about removing the size field is not just about keeping things simple, it's actually more about maintaining clarity - which file contains what information. Personally, I think that's a very good argument, and one I'd be in favour of.

The alternative approach is to store multiple variables per fixel (not just 2). The problem then is one of visibility into what data are actually stored in these files. We'd need to somehow make it clear which entry corresponds to what data, most likely through header entries, and design the UI in such a way as to expose this information to the user to allow them to choose which they'd like to use for what aspect of the display - quite a bit of work. Not to mention the fact that adding more entries (e.g. if you redo the stats using some other contrasts) would be ridiculously messy...

So I'd personally lean towards having one file per value, and rely on the filesystem to provide the naming and 'growable' aspects of our database system. But like @Lestropie https://github.com/Lestropie, I don't want this to be at the expense of negating all benefits of sparse storage. The obvious and only solution to this is to have a file that defines the fixel mapping (basically @Lestropie https://github.com/Lestropie's FixelIndex format), and store the values themselves in raw flat binary files (I don't see the point in CSV files here, more on that later). The fixel mapping would basically provide the offset into the data to retrieve the relevant data, much like it currently does. But making the mapping separate from the data allows each data file to maintain a reference to its mapping, and so keep redundancy down to a minimum. Again, this is something that we'd discussed before, but couldn't really come to an agreement about.

The sparse image format

So I think this prompts a re-examination of the sparse image format. I'm in the middle of updating the code for this to merge into the updated_syntax branch, so I think a discussion about this is very timely. There's a few points that I'd like to raise. The first is that currently, the sparse format is designed to take essentially a memory-dump of arbitrary structures (or objects). This means it is utterly un-interpretable by anything that doesn't already have knowledge of this structure. I appreciate the aim of keeping the format generic and very flexible, but ultimately I think it only make usage more obscure. In fact, I think this actually makes the format incomplete - a bit like vendors adding private tags to DICOM images. I would much rather see a format that restricts the data stored to a fixed number of pure types (int32, float32, etc), just the same as standard images. I'm not sure what applications there are currently for the sparse format that require storing data as distinct types for the same fixel, but I expect just about anything would be able to make use of a format whereby each fixel contains a fixed number of values of the same type. This could be a single float for the size or t-value, or a 3-vector of uint8 for an RGB triplet, or a full vector of floats for the SH coefficients of a representation of that FOD lobe. I suspect anything that needs to store more complex information than that would benefit from splitting that information across different files, so that each file can be named appropriately, typed appropriately, and used independently in different contexts. Complex applications that would have needed to use the full-blown class aspects of the current framework would just need more inputs.

How do we proceed from here

Clearly, the simplest thing to do is simply to carry on using the existing infrastructure, with all the frustrations that this would entail for all parties involved. Although @Lestropie https://github.com/Lestropie is currently arguing that the size field should stay for various reasons, I reckon he'd also agree that the current (necessary) practice of storing a whole new fixel image (i.e. including the full fixel mapping) for each output is not very satisfactory. Those arguing that the size field is more confusing than anything else will keep getting mildly annoyed by its remaining there, and newcomers will keep getting puzzled about this.

The FixelIndex idea is more in line with what I had in mind the last time we discussed this idea, and I think there are now very compelling reasons to try an approach of that type. Where I'd disagree with @Lestropie https://github.com/Lestropie is in the details: I don't think the size field should be given any special treatment, and I don't think we should support the kind of arbitrary data structures currently allowed by the sparse format. I also don't quite see the rationale for storing the data using CSV files - surely any application that realistically wants to use the information in those files also needs access to the fixel mapping? This would negate any benefit of making the data file readily parsed by 3rd party software, since a crucial part of the format would remain specific to MRtrix.

So I propose we consider an approach like this:

-

we have a fixel index map and several fixel value data sets, each linking explicitly to their corresponding (typically the same) fixel index map.

the fixel index map is a simple 4D, 2 volume image, storing uint32 types - first volume is the number of fixels in that voxel, second is the offset to the first of those in any of the corresponding data files. In this case, a zero in the first volume (number of fixels) is sufficient to denote no fixels, so there is no need for special handling in the case of no fixels. Note the offset is in terms of numbers of fixels, rather than a byte offset as such - this means any associated value file can store however many of whatever datatype it needs to, as long as those quantities are accessible in its header.

the data files use the standard MRtrix header format, with a distinct magic first line to identify its type as a set of fixel values. It stores a reference to its fixel index image by name, a count of entries per fixel, and a data type for each entry. This information is sufficient to convert an offset from the fixel map to a byte offset in the data file.

I think this has the advantage of simplicity, clarity, compactness, and being fully defined, which should tick most of the right boxes for most people in this discussion. The issues that are likely to arise if we do this would relate to the amount of coding effort required to do this, ongoing support for the current format, some concerns about problems should people delete or neglect to backup the fixel map image (and related issues), and some potential future applications @Lestropie https://github.com/Lestropie mentioned, namely handling mismatches between fixel mappings. So about these concerns:

-

I'm already putting in a lot of effort into the updated_syntax branch, and the amount of code that needs to change is unreal. The biggest contributor to this is the switch to Eigen, which affects every aspect of the codebase. As far as I'm concerned, I already need to put in a fair bit of effort updating the sparse format anyway, I don't really mind adding an extra layer on top of that...

There is no doubt we need ongoing support for the current format - there's too much data already generated using it. The simplest would be to add the new format as an alternative to the old one, but drop support for writing in the current format. Another option is to create a dedicated converter to port old data to the new format, which has the advantage of keeping the core of the codebase lean. I don't really mind either way, although the former is a safer bet, in case there is a nasty bug in the new codebase causing loss of data (we don't want to delete old data once converted only to find the newly converted data are corrupt).

I don't think we'll have too much trouble with users getting confused about the need to maintain their fixel mapping co-located with their fixel data files. A number of image formats already have similar requirements (NIfTI / Analyse, mih, for example), along with other formats we've come up with such as track scalar files or track weights files - I don't think this is any different in kind, really.

@Lestropie https://github.com/Lestropie mentioned using the FixelIndex idea to handle mismatches. I guess the big difference here is the FixelIndex's proposed use of unique ID's per fixel rather than just an offset, which would open the possibility of skipping or adding extra fixels in one data file compared to another, and have the same mapping handle both. I'm not sure I can see how the details would work exactly, and under what circumstances this would allow operations that would otherwise not be possible. The main thing is I don't see how you could use the same mapping with data generated from different mappings (e.g. native vs. template), since there would be no guarantee that the unique ID's would be comparable in any way, unless you'd somehow already matched them across mappings in the first place. I guess we'd need to discuss this, at face value this would seem like a lot of additional storage and runtime operations (due to the need f or a lookup table of some form) for some gain that I've yet to fully grasp...

Anyway, I'll stop now. By all means let's keep discussing this, but FFS let's please keep this civil - there's no need to get personal about this or lose any sleep over it. There's nothing wrong with healthy debate, and we'd all be shitty scientists if we didn't engage in it. I'm personally really proud of the fact that we're happy to have such a frank discussion about the very guts of our latest and greatest developments in such an open forum, and I think so should all of you.

— Reply to this email directly or view it on GitHub https://github.com/MRtrix3/mrtrix3/issues/317#issuecomment-128499728.

draffelt commented 9 years ago

The more I think about it, I'm still not a huge fan of having a separate index, directions, and value files. When I compute fixel AFD for all subjects in a population analysis, it means the folder will contain 3 files for each patient. It's not a big deal, and I'm happy to go with it if that's what everyone else wants. I just have memories of working with large datasets using the old analyse .hdr .img format. Yes, everyone knows they go together, but it was still annoying having multiple files when moving, sharing, listing large directories, and tab-completing file names. I'd happily sacrifice 4.5Mb to have the directions conveniently stored inside a typical upsampled fixel file.

On Fri, Aug 7, 2015 at 7:59 AM, David Raffelt draffelt@gmail.com wrote:

I think the solution you proposed is a good compromise. I'm happy as long as we can view a colour-mapped fixel file in the viewer by loading a single image. Hopefully this should be possibly if the data file references the index, and the index always as a reference to the directions. It would be a pain to have to load the directions first, then colour by having to load the data file. If we really want to support dynamic thresholding by one file, then colouring by another, we could still have an option to right click and 'colour by another fixel file'. Similar to what is done for TSF files.

For simplicity I'd be happy to ditch the existing fixel format for release and provide a conversion tool.

On Fri, Aug 7, 2015 at 6:35 AM, J-Donald Tournier < notifications@github.com> wrote:

Man, I'm trying to enjoy my beach holiday in Greece here, keep the noise down!

Seriously though, if I'd known my comment was going to cause such an outpouring of emotion, I might have delayed until I was back at my desk... Instead of which I've had to go to the phone shop to get myself connected properly, just so I can add my 2 cents to the discussion. Anyway, I'll try not to add too much oil on the fire in the process. I've also tried to keep it brief, but as usual that didn't really work out - better grab yourselves a brew...

So to start off with, I guess we've all had different ideas about what the sparse format should be, and on top of that about how fixel data should be stored. I remember arguing that the sparse format should be sparse in all dimensions, not just the 4th. But then that wouldn't really impact the current discussion... The point is, we've all got our own ideas, and they've all got their own merits. This would be a lot easier to deal with if these discussions had been taken to completion before all this codebase evolved around the current implementation, but at the same time we all have to realise that sometimes the best (and possibly only) way to assessing how good a solution is is simply to try it and see. MRtrix is where it is now by just such a process (I think I'm probably at iteration 5 or 6 by now...) - just witness the current updated_syntax branch if you need any further demonstration of the ongoing evolution of MRtrix... Anyway, at least we have a solution in place now that fill most of our current needs, although there clearly is a lot of scope for improvement on many levels.

Right, onto the specifics. There's a lot of open questions, I'll try to address them in a sensible order...

What should a fixel image contain?

I have to admit, initially I was leaning towards the much simpler interpretation that @thijsdhollander https://github.com/thijsdhollander spelt out very clearly: a fixel image contains a value per fixel. What that fixel value represents is up to the user (or at least the software that generated the image), and the filename can be used to identify this information appropriately.

But having heard @Lestropie https://github.com/Lestropie's arguments for the fixel size having a special meaning as the weight of the fixel, I'm no longer so sure. When dealing with voxels, there is no ambiguity as to the volume each voxel represents. But now we're breaking each voxel up into corresponding fixels, it's clear that each fixel no longer represents comparable 'amounts' of stuff. As Rob says, these weights can be used in various contexts in addition to the value of interest (e.g. the fixel2voxel command). A large part of my initial reluctance to the having both a size and a value was about the vagueness of the terminology - it was not clear to me which field represented what, and why you'd need two of them. I still think it's not sufficiently clear which is which, and for new users, why both would be required. We've already debated whether size was the right term before, and I don't think we've settled the issue... But at least I can now properly understand the rationale for having an explicit size/weight/volume.

But then, as @draffelt https://github.com/draffelt pointed out, this is still nothing that can't be done by providing both an image of the weights, and an image of the values. So whether we give the 'weight' such a prominent role (as is currently the case) remains more about ideology (for want of a better word) than actual capabilities. We can do everything we need with a simplified fixel format, and I personally would agree that having each fixel image refer to a single quantity keeps everything much more intelligible for Joe User (myself included), since it would (or at least should) be clear which operations require the weights and why.

The obvious downside to having one value per file is the massive amount of redundancy this leads to, in terms of the storage of the fixel mapping. This is I guess @Lestropie https://github.com/Lestropie's main objection to this - why use a sparse format to then replicate most of its information across many different fixel files? I agree, but it's also clear that getting around this requires a little bit more complexity - we've already discussed that one to death before, but I guess we need to go for one more round...

What information needs to be stored in fixel images?

So let's go back to basics: what we actually need here is a 2-level relational database: first level maps each voxel to a list of corresponding fixels, second level maps each fixel to its corresponding values. We're essentially trying to set up our own database to store our fixel information... There is no simple fix for this, it's always going to be a bit more complex than we'd like. We have to try to find a solution we're all happy with that remains as brain-dead simple as possible, yet flexible enough for our current and future needs.

My first observation here is that there will be many values we want to store per fixel, and this number is going to vary between different studies and applications. From this point of view, storing 2 variables seems very odd - even if one of them (the size) can be argued to be more fundamental than the others. If we stored just one value per fixel, that wouldn't seem strange. If we could store any number of variables, that would be fine too (although that would introduce its own problems, see later). But two just seems plain odd to newcomers (and clearly to others too...), and even though the logic behind that decision does make sense once explained, I'm not sure I agree that it makes enough sense...

So the current argument about removing the size field is not just about keeping things simple, it's actually more about maintaining clarity - which file contains what information. Personally, I think that's a very good argument, and one I'd be in favour of.

The alternative approach is to store multiple variables per fixel (not just 2). The problem then is one of visibility into what data are actually stored in these files. We'd need to somehow make it clear which entry corresponds to what data, most likely through header entries, and design the UI in such a way as to expose this information to the user to allow them to choose which they'd like to use for what aspect of the display - quite a bit of work. Not to mention the fact that adding more entries (e.g. if you redo the stats using some other contrasts) would be ridiculously messy...

So I'd personally lean towards having one file per value, and rely on the filesystem to provide the naming and 'growable' aspects of our database system. But like @Lestropie https://github.com/Lestropie, I don't want this to be at the expense of negating all benefits of sparse storage. The obvious and only solution to this is to have a file that defines the fixel mapping (basically @Lestropie https://github.com/Lestropie's FixelIndex format), and store the values themselves in raw flat binary files (I don't see the point in CSV files here, more on that later). The fixel mapping would basically provide the offset into the data to retrieve the relevant data, much like it currently does. But making the mapping separate from the data allows each data file to maintain a reference to its mapping, and so keep redundancy down to a minimum. Again, this is something that we'd discussed before, but couldn't really come to an agreement about.

The sparse image format

So I think this prompts a re-examination of the sparse image format. I'm in the middle of updating the code for this to merge into the updated_syntax branch, so I think a discussion about this is very timely. There's a few points that I'd like to raise. The first is that currently, the sparse format is designed to take essentially a memory-dump of arbitrary structures (or objects). This means it is utterly un-interpretable by anything that doesn't already have knowledge of this structure. I appreciate the aim of keeping the format generic and very flexible, but ultimately I think it only make usage more obscure. In fact, I think this actually makes the format incomplete - a bit like vendors adding private tags to DICOM images. I would much rather see a format that restricts the data stored to a fixed number of pure types (int32, float32, etc), just the same as standard images. I'm not sure what applications there are currently for the sparse format that require storing data as distinct types for the same fixel, but I expect just about anything would be able to make use of a format whereby each fixel contains a fixed number of values of the same type. This could be a single float for the size or t-value, or a 3-vector of uint8 for an RGB triplet, or a full vector of floats for the SH coefficients of a representation of that FOD lobe. I suspect anything that needs to store more complex information than that would benefit from splitting that information across different files, so that each file can be named appropriately, typed appropriately, and used independently in different contexts. Complex applications that would have needed to use the full-blown class aspects of the current framework would just need more inputs.

How do we proceed from here

Clearly, the simplest thing to do is simply to carry on using the existing infrastructure, with all the frustrations that this would entail for all parties involved. Although @Lestropie https://github.com/Lestropie is currently arguing that the size field should stay for various reasons, I reckon he'd also agree that the current (necessary) practice of storing a whole new fixel image (i.e. including the full fixel mapping) for each output is not very satisfactory. Those arguing that the size field is more confusing than anything else will keep getting mildly annoyed by its remaining there, and newcomers will keep getting puzzled about this.

The FixelIndex idea is more in line with what I had in mind the last time we discussed this idea, and I think there are now very compelling reasons to try an approach of that type. Where I'd disagree with @Lestropie https://github.com/Lestropie is in the details: I don't think the size field should be given any special treatment, and I don't think we should support the kind of arbitrary data structures currently allowed by the sparse format. I also don't quite see the rationale for storing the data using CSV files - surely any application that realistically wants to use the information in those files also needs access to the fixel mapping? This would negate any benefit of making the data file readily parsed by 3rd party software, since a crucial part of the format would remain specific to MRtrix.

So I propose we consider an approach like this:

-

we have a fixel index map and several fixel value data sets, each linking explicitly to their corresponding (typically the same) fixel index map.

the fixel index map is a simple 4D, 2 volume image, storing uint32 types - first volume is the number of fixels in that voxel, second is the offset to the first of those in any of the corresponding data files. In this case, a zero in the first volume (number of fixels) is sufficient to denote no fixels, so there is no need for special handling in the case of no fixels. Note the offset is in terms of numbers of fixels, rather than a byte offset as such - this means any associated value file can store however many of whatever datatype it needs to, as long as those quantities are accessible in its header.

the data files use the standard MRtrix header format, with a distinct magic first line to identify its type as a set of fixel values. It stores a reference to its fixel index image by name, a count of entries per fixel, and a data type for each entry. This information is sufficient to convert an offset from the fixel map to a byte offset in the data file.

I think this has the advantage of simplicity, clarity, compactness, and being fully defined, which should tick most of the right boxes for most people in this discussion. The issues that are likely to arise if we do this would relate to the amount of coding effort required to do this, ongoing support for the current format, some concerns about problems should people delete or neglect to backup the fixel map image (and related issues), and some potential future applications @Lestropie https://github.com/Lestropie mentioned, namely handling mismatches between fixel mappings. So about these concerns:

-

I'm already putting in a lot of effort into the updated_syntax branch, and the amount of code that needs to change is unreal. The biggest contributor to this is the switch to Eigen, which affects every aspect of the codebase. As far as I'm concerned, I already need to put in a fair bit of effort updating the sparse format anyway, I don't really mind adding an extra layer on top of that...

There is no doubt we need ongoing support for the current format - there's too much data already generated using it. The simplest would be to add the new format as an alternative to the old one, but drop support for writing in the current format. Another option is to create a dedicated converter to port old data to the new format, which has the advantage of keeping the core of the codebase lean. I don't really mind either way, although the former is a safer bet, in case there is a nasty bug in the new codebase causing loss of data (we don't want to delete old data once converted only to find the newly converted data are corrupt).

I don't think we'll have too much trouble with users getting confused about the need to maintain their fixel mapping co-located with their fixel data files. A number of image formats already have similar requirements (NIfTI / Analyse, mih, for example), along with other formats we've come up with such as track scalar files or track weights files - I don't think this is any different in kind, really.

@Lestropie https://github.com/Lestropie mentioned using the FixelIndex idea to handle mismatches. I guess the big difference here is the FixelIndex's proposed use of unique ID's per fixel rather than just an offset, which would open the possibility of skipping or adding extra fixels in one data file compared to another, and have the same mapping handle both. I'm not sure I can see how the details would work exactly, and under what circumstances this would allow operations that would otherwise not be possible. The main thing is I don't see how you could use the same mapping with data generated from different mappings (e.g. native vs. template), since there would be no guarantee that the unique ID's would be comparable in any way, unless you'd somehow already matched them across mappings in the first place. I guess we'd need to discuss this, at face value this would seem like a lot of additional storage and runtime operations (due to the need f or a lookup table of some form) for some gain that I've yet to fully grasp...

Anyway, I'll stop now. By all means let's keep discussing this, but FFS let's please keep this civil - there's no need to get personal about this or lose any sleep over it. There's nothing wrong with healthy debate, and we'd all be shitty scientists if we didn't engage in it. I'm personally really proud of the fact that we're happy to have such a frank discussion about the very guts of our latest and greatest developments in such an open forum, and I think so should all of you.

— Reply to this email directly or view it on GitHub https://github.com/MRtrix3/mrtrix3/issues/317#issuecomment-128499728.

draffelt commented 9 years ago

Sorry, it's more than 4.5Mb. Obviously the offset needs to be stored too if we have everything in one file.

On Fri, Aug 7, 2015 at 12:17 PM, David Raffelt draffelt@gmail.com wrote:

The more I think about it, I'm still not a huge fan of having a separate index, directions, and value files. When I compute fixel AFD for all subjects in a population analysis, it means the folder will contain 3 files for each patient. It's not a big deal, and I'm happy to go with it if that's what everyone else wants. I just have memories of working with large datasets using the old analyse .hdr .img format. Yes, everyone knows they go together, but it was still annoying having multiple files when moving, sharing, listing large directories, and tab-completing file names. I'd happily sacrifice 4.5Mb to have the directions conveniently stored inside a typical upsampled fixel file.

On Fri, Aug 7, 2015 at 7:59 AM, David Raffelt draffelt@gmail.com wrote:

I think the solution you proposed is a good compromise. I'm happy as long as we can view a colour-mapped fixel file in the viewer by loading a single image. Hopefully this should be possibly if the data file references the index, and the index always as a reference to the directions. It would be a pain to have to load the directions first, then colour by having to load the data file. If we really want to support dynamic thresholding by one file, then colouring by another, we could still have an option to right click and 'colour by another fixel file'. Similar to what is done for TSF files.

For simplicity I'd be happy to ditch the existing fixel format for release and provide a conversion tool.

On Fri, Aug 7, 2015 at 6:35 AM, J-Donald Tournier < notifications@github.com> wrote:

Man, I'm trying to enjoy my beach holiday in Greece here, keep the noise down!

Seriously though, if I'd known my comment was going to cause such an outpouring of emotion, I might have delayed until I was back at my desk... Instead of which I've had to go to the phone shop to get myself connected properly, just so I can add my 2 cents to the discussion. Anyway, I'll try not to add too much oil on the fire in the process. I've also tried to keep it brief, but as usual that didn't really work out - better grab yourselves a brew...

So to start off with, I guess we've all had different ideas about what the sparse format should be, and on top of that about how fixel data should be stored. I remember arguing that the sparse format should be sparse in all dimensions, not just the 4th. But then that wouldn't really impact the current discussion... The point is, we've all got our own ideas, and they've all got their own merits. This would be a lot easier to deal with if these discussions had been taken to completion before all this codebase evolved around the current implementation, but at the same time we all have to realise that sometimes the best (and possibly only) way to assessing how good a solution is is simply to try it and see. MRtrix is where it is now by just such a process (I think I'm probably at iteration 5 or 6 by now...) - just witness the current updated_syntax branch if you need any further demonstration of the ongoing evolution of MRtrix... Anyway, at least we have a solution in place now that fill most of our current needs, although there clearly is a lot of scope for improvement on many levels.

Right, onto the specifics. There's a lot of open questions, I'll try to address them in a sensible order...

What should a fixel image contain?

I have to admit, initially I was leaning towards the much simpler interpretation that @thijsdhollander https://github.com/thijsdhollander spelt out very clearly: a fixel image contains a value per fixel. What that fixel value represents is up to the user (or at least the software that generated the image), and the filename can be used to identify this information appropriately.

But having heard @Lestropie https://github.com/Lestropie's arguments for the fixel size having a special meaning as the weight of the fixel, I'm no longer so sure. When dealing with voxels, there is no ambiguity as to the volume each voxel represents. But now we're breaking each voxel up into corresponding fixels, it's clear that each fixel no longer represents comparable 'amounts' of stuff. As Rob says, these weights can be used in various contexts in addition to the value of interest (e.g. the fixel2voxel command). A large part of my initial reluctance to the having both a size and a value was about the vagueness of the terminology - it was not clear to me which field represented what, and why you'd need two of them. I still think it's not sufficiently clear which is which, and for new users, why both would be required. We've already debated whether size was the right term before, and I don't think we've settled the issue... But at least I can now properly understand the rationale for having an explicit size/weight/volume.

But then, as @draffelt https://github.com/draffelt pointed out, this is still nothing that can't be done by providing both an image of the weights, and an image of the values. So whether we give the 'weight' such a prominent role (as is currently the case) remains more about ideology (for want of a better word) than actual capabilities. We can do everything we need with a simplified fixel format, and I personally would agree that having each fixel image refer to a single quantity keeps everything much more intelligible for Joe User (myself included), since it would (or at least should) be clear which operations require the weights and why.

The obvious downside to having one value per file is the massive amount of redundancy this leads to, in terms of the storage of the fixel mapping. This is I guess @Lestropie https://github.com/Lestropie's main objection to this - why use a sparse format to then replicate most of its information across many different fixel files? I agree, but it's also clear that getting around this requires a little bit more complexity - we've already discussed that one to death before, but I guess we need to go for one more round...

What information needs to be stored in fixel images?

So let's go back to basics: what we actually need here is a 2-level relational database: first level maps each voxel to a list of corresponding fixels, second level maps each fixel to its corresponding values. We're essentially trying to set up our own database to store our fixel information... There is no simple fix for this, it's always going to be a bit more complex than we'd like. We have to try to find a solution we're all happy with that remains as brain-dead simple as possible, yet flexible enough for our current and future needs.

My first observation here is that there will be many values we want to store per fixel, and this number is going to vary between different studies and applications. From this point of view, storing 2 variables seems very odd - even if one of them (the size) can be argued to be more fundamental than the others. If we stored just one value per fixel, that wouldn't seem strange. If we could store any number of variables, that would be fine too (although that would introduce its own problems, see later). But two just seems plain odd to newcomers (and clearly to others too...), and even though the logic behind that decision does make sense once explained, I'm not sure I agree that it makes enough sense...

So the current argument about removing the size field is not just about keeping things simple, it's actually more about maintaining clarity - which file contains what information. Personally, I think that's a very good argument, and one I'd be in favour of.

The alternative approach is to store multiple variables per fixel (not just 2). The problem then is one of visibility into what data are actually stored in these files. We'd need to somehow make it clear which entry corresponds to what data, most likely through header entries, and design the UI in such a way as to expose this information to the user to allow them to choose which they'd like to use for what aspect of the display - quite a bit of work. Not to mention the fact that adding more entries (e.g. if you redo the stats using some other contrasts) would be ridiculously messy...

So I'd personally lean towards having one file per value, and rely on the filesystem to provide the naming and 'growable' aspects of our database system. But like @Lestropie https://github.com/Lestropie, I don't want this to be at the expense of negating all benefits of sparse storage. The obvious and only solution to this is to have a file that defines the fixel mapping (basically @Lestropie https://github.com/Lestropie's FixelIndex format), and store the values themselves in raw flat binary files (I don't see the point in CSV files here, more on that later). The fixel mapping would basically provide the offset into the data to retrieve the relevant data, much like it currently does. But making the mapping separate from the data allows each data file to maintain a reference to its mapping, and so keep redundancy down to a minimum. Again, this is something that we'd discussed before, but couldn't really come to an agreement about.

The sparse image format

So I think this prompts a re-examination of the sparse image format. I'm in the middle of updating the code for this to merge into the updated_syntax branch, so I think a discussion about this is very timely. There's a few points that I'd like to raise. The first is that currently, the sparse format is designed to take essentially a memory-dump of arbitrary structures (or objects). This means it is utterly un-interpretable by anything that doesn't already have knowledge of this structure. I appreciate the aim of keeping the format generic and very flexible, but ultimately I think it only make usage more obscure. In fact, I think this actually makes the format incomplete - a bit like vendors adding private tags to DICOM images. I would much rather see a format that restricts the data stored to a fixed number of pure types (int32, float32, etc), just the same as standard images. I'm not sure what applications there are currently for the sparse format that require storing data as distinct types for the same fixel, but I expect just about anything would be able to make use of a format whereby each fixel contains a fixed number of values of the same type. This could be a single float for the size or t-value, or a 3-vector of uint8 for an RGB triplet, or a full vector of floats for the SH coefficients of a representation of that FOD lobe. I suspect anything that needs to store more complex information than that would benefit from splitting that information across different files, so that each file can be named appropriately, typed appropriately, and used independently in different contexts. Complex applications that would have needed to use the full-blown class aspects of the current framework would just need more inputs.

How do we proceed from here

Clearly, the simplest thing to do is simply to carry on using the existing infrastructure, with all the frustrations that this would entail for all parties involved. Although @Lestropie https://github.com/Lestropie is currently arguing that the size field should stay for various reasons, I reckon he'd also agree that the current (necessary) practice of storing a whole new fixel image (i.e. including the full fixel mapping) for each output is not very satisfactory. Those arguing that the size field is more confusing than anything else will keep getting mildly annoyed by its remaining there, and newcomers will keep getting puzzled about this.

The FixelIndex idea is more in line with what I had in mind the last time we discussed this idea, and I think there are now very compelling reasons to try an approach of that type. Where I'd disagree with @Lestropie https://github.com/Lestropie is in the details: I don't think the size field should be given any special treatment, and I don't think we should support the kind of arbitrary data structures currently allowed by the sparse format. I also don't quite see the rationale for storing the data using CSV files - surely any application that realistically wants to use the information in those files also needs access to the fixel mapping? This would negate any benefit of making the data file readily parsed by 3rd party software, since a crucial part of the format would remain specific to MRtrix.

So I propose we consider an approach like this:

-

we have a fixel index map and several fixel value data sets, each linking explicitly to their corresponding (typically the same) fixel index map.

the fixel index map is a simple 4D, 2 volume image, storing uint32 types - first volume is the number of fixels in that voxel, second is the offset to the first of those in any of the corresponding data files. In this case, a zero in the first volume (number of fixels) is sufficient to denote no fixels, so there is no need for special handling in the case of no fixels. Note the offset is in terms of numbers of fixels, rather than a byte offset as such - this means any associated value file can store however many of whatever datatype it needs to, as long as those quantities are accessible in its header.

the data files use the standard MRtrix header format, with a distinct magic first line to identify its type as a set of fixel values. It stores a reference to its fixel index image by name, a count of entries per fixel, and a data type for each entry. This information is sufficient to convert an offset from the fixel map to a byte offset in the data file.

I think this has the advantage of simplicity, clarity, compactness, and being fully defined, which should tick most of the right boxes for most people in this discussion. The issues that are likely to arise if we do this would relate to the amount of coding effort required to do this, ongoing support for the current format, some concerns about problems should people delete or neglect to backup the fixel map image (and related issues), and some potential future applications @Lestropie https://github.com/Lestropie mentioned, namely handling mismatches between fixel mappings. So about these concerns:

-

I'm already putting in a lot of effort into the updated_syntax branch, and the amount of code that needs to change is unreal. The biggest contributor to this is the switch to Eigen, which affects every aspect of the codebase. As far as I'm concerned, I already need to put in a fair bit of effort updating the sparse format anyway, I don't really mind adding an extra layer on top of that...

There is no doubt we need ongoing support for the current format - there's too much data already generated using it. The simplest would be to add the new format as an alternative to the old one, but drop support for writing in the current format. Another option is to create a dedicated converter to port old data to the new format, which has the advantage of keeping the core of the codebase lean. I don't really mind either way, although the former is a safer bet, in case there is a nasty bug in the new codebase causing loss of data (we don't want to delete old data once converted only to find the newly converted data are corrupt).

I don't think we'll have too much trouble with users getting confused about the need to maintain their fixel mapping co-located with their fixel data files. A number of image formats already have similar requirements (NIfTI / Analyse, mih, for example), along with other formats we've come up with such as track scalar files or track weights files - I don't think this is any different in kind, really.

@Lestropie https://github.com/Lestropie mentioned using the FixelIndex idea to handle mismatches. I guess the big difference here is the FixelIndex's proposed use of unique ID's per fixel rather than just an offset, which would open the possibility of skipping or adding extra fixels in one data file compared to another, and have the same mapping handle both. I'm not sure I can see how the details would work exactly, and under what circumstances this would allow operations that would otherwise not be possible. The main thing is I don't see how you could use the same mapping with data generated from different mappings (e.g. native vs. template), since there would be no guarantee that the unique ID's would be comparable in any way, unless you'd somehow already matched them across mappings in the first place. I guess we'd need to discuss this, at face value this would seem like a lot of additional storage and runtime operations (due to the need f or a lookup table of some form) for some gain that I've yet to fully grasp...

Anyway, I'll stop now. By all means let's keep discussing this, but FFS let's please keep this civil - there's no need to get personal about this or lose any sleep over it. There's nothing wrong with healthy debate, and we'd all be shitty scientists if we didn't engage in it. I'm personally really proud of the fact that we're happy to have such a frank discussion about the very guts of our latest and greatest developments in such an open forum, and I think so should all of you.

— Reply to this email directly or view it on GitHub https://github.com/MRtrix3/mrtrix3/issues/317#issuecomment-128499728.

Lestropie commented 9 years ago

So I'd personally lean towards having one file per value, and rely on the filesystem to provide the naming and 'growable' aspects of our database system.

I definitely didn't want everything to move to a FixelIndex style storage: Dave had already expressed concerns about needing to have multiple images per subject in a group analysis, and it was never my intention anyway. To me using FixelMetric for per-subject data and FixelIndex for stats output made more sense.

My tracking algorithm project requires more thorough FOD segmentation and parameterisation. Therefore, for each fixel, I will need: Mean direction, size, primary dispersion axis, dispersion multipliers for major and minor axes, and maybe some boolean flags. And that's not taking into account boot-strapping, which could provide measures of uncertainty into all of these measures. I very much don't want to have to input each of these data independently at the command-line.

I'm also not a great fan of having inter-file dependencies (in the strong sense of 'dependency') generally. I thought about this when deciding how to handle the weights from SIFT2, and ended up just using a .csv: As long as the number of elements is correct, it doesn't really matter where the data came from, it can be handled in a meaningful way. I think the same holds for fixel data: I have no problem having such data as a raw binary dump rather than a .csv, but an explicit link in such a file back to the sparse image would be required to enable this:

I'm happy as long as we can view a colour-mapped fixel file in the viewer by loading a single image.

Having said that, I think Donald has used a similar logic to arrive at these kind of ideas to what I did; but I might just make my original logic explicit here in case it brings up anything new:

When I was implementing the format, I had envisaged that the FixelMetric class would handle per-subject data, but the results of statistical analysis would be written to a single sparse image file (per test design) using a new class; let's call it FixelStats for now. This class would contain, for each fixel: Direction, size, T-statistic, enhanced T-statistic, corrected p-value, effect size, ... The viewer toolbar would then recognise such a file, and present to the user all relevant options for changing the visualisation based on these data, all encased within a single file. However hearing Dave speaking about the need for a fixelcalc command, I presumed that it would be dangerous to hard-code such a class, as an incremental code change could render the class incomplete. Therefore, I considered the case where the number of data fields to store for each fixel would be infinite, and access to the data would be necessary for mathematical manipulation, and concluded on the FixelIndex idea.

Just to be sure: Would this not work? @draffelt? Because if it would, you could have all your stats output in a single file (per design) ready to go, my existing and future code would be unaffected, there would be no need for change to the image format, .... Sure the directions would be common between each file derived from the same fixel template, but that redundancy would be nowhere near what we have now, and would be sensible in order to have all data necessary for assessing a statistical result bundled in one neat file.

I also don't quite see the rationale for storing the data using CSV files - surely any application that realistically wants to use the information in those files also needs access to the fixel mapping?

No. Dave's quintessential example is converting GLM betas to effect sizes: my understanding is that this can be done on a per-fixel basis without any knowledge of the position or other parameters of the fixel. So the processing can be done on raw vector data, using any software; it's only when combined with the fixel information that you get the context of each value.

The first is that currently, the sparse format is designed to take essentially a memory-dump of arbitrary structures (or objects). This means it is utterly un-interpretable by anything that doesn't already have knowledge of this structure.

I would much rather see a format that restricts the data stored to a fixed number of pure types (int32, float32, etc), just the same as standard images.

A series of floats in a file is just as un-interpreable, in a way. Sure you can read it, but if you don't know what each of those values actually represents, there's not much you can do with them. The class structure is open-source; downloading that file will tell you both what's in there and what each represents.

I'm not really convinced that that restriction would make the format any more 'standard': as long as you know the class template, and know the entry point, you know what data is what. It's also nice to be able to read a class such as a Point directly from the data (even if convenience functions could provide similar). I also wanted to be able to use uint8's as combined boolean flags in a future fixel class, and potentially integers in another; hence the class dump approach.

Lestropie mentioned using the FixelIndex idea to handle mismatches. I guess the big difference here is the FixelIndex's proposed use of unique ID's per fixel rather than just an offset, which would open the possibility of skipping or adding extra fixels in one data file compared to another, and have the same mapping handle both. I'm not sure I can see how the details would work exactly, and under what circumstances this would allow operations that would otherwise not be possible. The main thing is I don't see how you could use the same mapping with data generated from different mappings (e.g. native vs. template), since there would be no guarantee that the unique ID's would be comparable in any way, unless you'd somehow already matched them across mappings in the first place. I guess we'd need to discuss this, at face value this would seem like a lot of additional storage and runtime operations (due to the need for a lookup table of some form) for some gain that I've yet to fully grasp...

I think this might have been a mis-understanding. Could be one of two things:

Lestropie commented 9 years ago

Also, just because I can...

A large part of my initial reluctance to the having both a size and a value was about the vagueness of the terminology - it was not clear to me which field represented what, and why you'd need two of them. I still think it's not sufficiently clear which is which, and for new users, why both would be required. We've already debated whether size was the right term before, and I don't think we've settled the issue... But at least I can now properly understand the rationale for having an explicit size/weight/volume.

... How about 'density'? maniacal laugh

jdtournier commented 9 years ago

OK, I'm getting totally conflicting messages about what the problems are and what each person is after in this discussion. Let me try to recap to see if I can sort out the mess in my own head...

Currently we have a format that stores an image of 64-bit byte offsets into the same file, with the first element encountered being a 32-bit fixel count for that voxel, followed by that number of 20 byte entries for the fixel-specific information (3xdirection, 1xsize, 1xvalue, each as a 4-byte float). I'd forgotten about the direction being included in there, this really does mean the amount of redundancy is ludicrous if we only store one value (or even 2) per fixel, and then use different files for different values. As has been mentioned before, only 20% of the per-fixel information would actually change between files, which would equate to much less than that by the time you include the original mapping (i,e. the per-voxel byte offsets). It really is a monumental waste of space, regardless of whether the size is stored alongside the value or not...

But then we have conflicting wishes from different people. Some would rather keep each value in its own file - this is also my preferred option, it allows for dynamic growth, allows arbitrary naming of each value, keeps the UI simple, makes things clear and obvious as to what information is required for what command, etc. But as things stand this is an ugly solution, with the enormous amount of data duplication this entails.

Others might prefer that the various outputs were co-located in the same sparse file, which avoids duplication of information, but at the expense of visibility into what data are actually stored in the file, no dynamic growth of the data (can't easily add fields to an existing file), and reduced flexibility into the information stored (depending on how it's implemented). On that last point, what I mean by reduced flexibility is that either you have to use the same data type for all values (what I'd suggested, but for a split format), or you have to modify all the code that might interact with the data if you decide to add an entry to each fixel (the current template-based sparse format).

And we also now have intermediate proposals with the FixelMetric & FixelIndex classes - we keep the current format as-is for those cases where it can store all the information we need anyway, and use a split format for those cases where we anticipate many per-fixel values. Kind of the best of both worlds, I guess, but it still doesn't seem to tick everybody's boxes...

So let's point out the obvious: we can't have our cake and eat it. Something's going to jar somewhere. Which is why I'd argue for the most brain-dead simple approach we can come up with, but one that also allows maximum flexibility. I won't repeat my proposed approach to this, but I'll just add one more feature to it: all elements of the format will be enclosed within their own folder. So that would allow us to use the filesystem to name both the overall image itself (the folder name) and all elements within it (the individual data files), allowing each data file to have its own type and count, and providing a very transparent and flexible format for ourselves and 3rd party applications - something that is actually much closer to a database in fact, but without needing to implement our own backend or start using sqlite or something... And even If an application needs to have a enormous number of per-fixel inputs of different types, simply naming the folder on the command-line is sufficient to access the entire data set - all that's required would be consistent naming conventions for the data files. I think an approach like this would solve most of the issues people have come up with so far, and would provide all the flexibility we need for future applications.

Another reason I prefer this approach is it remains entirely generic, and completely independent of fixel-based stuff. The current solutions put forward all at least include the fixel direction (+ size) within the mapping itself, which ties the format very explicitly to fixel-based analysis. What if I wanted to use this format to store non-directional data, that nonetheless had a variable number of 'elements' per voxel - maybe tissue types for instance? I don't think it would be inconceivable to use such a format for such data, especially if we wanted to compute statistics per tissue type, with all the outputs that entails - the amount of sparsity would probably be similar to the current fixel-based approaches...

I'm also not a fan of using the template-based sparse format approach - as I said before, I think it leaves the format undefined. It's more a specification for a format than a format as such. Of course, you can always browse through the code to figure out what information each fixel contains, but then you could do that for the most obscure of data formats - by the time you've had to look through the source code, you can't argue that it's a simple, write-your-matlab-reader-once job to support the format. By sticking to standard data types, you do allow the format to be specified completely as a standalone specification. Each field is named externally (i.e. not within the source code only), so allowing interpretation of the data by users or 3rd party developers without resorting to browsing through source code.

As to the unique ID issue:

With regards to fixel mis-matching, I was speaking in the context of matching fixels between subject and template space. The second streamlines clustering approach relied on forming fixels based on a dixel TDI, then matching fixels between adjacent voxels: an almost identical problem. I know from my experience there that having the density of each fixel, independently of any other parameter, is useful; having a measure of dispersion for each would be even better, but was too narrow a use case for inclusion in the class.

But all this is about within application processing, not storage and subsequent analysis as a separate step? There's nothing to stop you from keeping all this information together while processing...

More to the point, matching dixels to fixels is actually quite different from matching fixels to fixels - you have a fixed number of dixels per voxel, and their directions are pre-specified. So you can have unique ID's per dixel that are guaranteed comparable across voxels or even across images, but I don't see how you can guarantee that for fixels - see second point below.

With my FixelIndex approach, the sparse image would store a direction, size, and integer index for each fixel; the index is just a row number for reading from external data files, whether it be .csv's or something else. So they're unique identifiers, but they're not randomly generated or anything like that. It sounds to me like you're suggesting having the sparse file provide the first index and fixel count for each voxel only; the fixel directions are then provided as the first of (potentially many) external data files, with 3 floats per fixel. This to me would feel like a regression, providing in separate files what was previously encapsulated in a single file. But like the FixelIndex approach, it depends on the relative priority for allowing an infinitely-expandable list of parameters for each fixel.

I'm still don't see how this would make any difference. If your unique ID is the row number, then these IDs won't match the minute one of your voxels has a different number of fixels within it - every ID after that would be offset between the two files. And even if you did account for that (let's say you use a two part unique ID, to specify the voxel and the fixel number within it), you still can't guarantee matching since the order of the directions might be completely unrelated - and that assumes the number of fixels matches. By the time you have a unique ID that genuinely allows to take into account voxel location and direction, so that two fixels in different files genuinely are guaranteed to match, you might as well just use the existing voxel index and direction information as-is to handle the matching problem... And there's no way you can guarantee that two fixels with different unique ID's don't match, they might still be very closely aligned with each other, but happen to have been assigned different IDs...

Also, if you were to use the row number as the unique ID, then this ID is completely specified by having just the ID of the first fixel within a given voxel (i.e. its offset), and the number of the fixel you're interested in within that voxel. Unless I'm missing something...?

Anyway, all for now. I'm guessing this discussion will be going on for some time, but let's all make sure we chip in with our points of view, it's worth getting this one right...

Lestropie commented 9 years ago

But all this is about within application processing, not storage and subsequent analysis as a separate step? There's nothing to stop you from keeping all this information together while processing...

Depends on whether the per-subject FOD images are fed to the command, or a pre-segmented fixel image per subject. If the latter, and the size field were not present in the fixel image format, I wouldn't have this information to work with; unless multiple input images per subject were provided in order to get that information, which comes back to the data redundancy argument.

As an alternative example, consider the case where two subject fixels match to a single template fixel, and your quantitative measure is mean axon diameter (as a demonstrate example... bear with me). You want to derive an appropriate value for mean axon diameter for that template fixel for this subject. The most physically appropriate thing to do would be to take a weighted average of the two subject fixel diameters, with weights according to the relative volumes of the two populations.

These examples are why I consider the fixel volume to be a fundamental parameter like the direction, rather than 'just another value'.

More to the point, matching dixels to fixels is actually quite different from matching fixels to fixels - you have a fixed number of dixels per voxel, and their directions are pre-specified. So you can have unique ID's per dixel that are guaranteed comparable across voxels or even across images, but I don't see how you can guarantee that for fixels - see second point below.

Mmm, still some misunderstanding I think. In that method, the dixel TDI is used to determine an appropriate set of fixels in each voxel; subsequently to that, the resulting fixels must be explicitly matched between adjacent voxels, even though the number of fixels in adjacent voxels may differ. This is very much akin to matching per-subject fixels, as segmented from the per-subject FOD, to a group average template set of fixels.

I'm still don't see how this would make any difference. If your unique ID is the row number, then these IDs won't match the minute one of your voxels has a different number of fixels within it - every ID after that would be offset between the two files. ...

Again, I think there's a mis-understanding. In the approach I had proposed, the sparse image format would still handle the number of elements stored in each voxel, and each element would be represented as a data dump of a class structure. However, in the specific case of a sparse image containing the FixelIndex class, the data for each element (fixel) would be a direction, a size, and an integer: that integer would correspond to a row index in any data file that may be associated. Therefore, if you have two data files originating from the same fixel image, it is guaranteed that the data stored in row 4191 in each file corresponds to the exact same fixel in the originating fixel image. However, you don't actually know which fixel that is, unless you load the fixel image and find the fixel that contains index 4191. In the case of CFE, the output of the command would be (for each contrast) one fixel image using this type, and a series of raw data files, each containing the value of some parameter for each fixel.

Note that this is a completely issue to the explicit fixel matching problem described in the previous paragraph.

Also, if you were to use the row number as the unique ID, then this ID is completely specified by having just the ID of the first fixel within a given voxel (i.e. its offset), and the number of the fixel you're interested in within that voxel. Unless I'm missing something...?

You could. But if you're already going to the effort of storing per-fixel data within the fixel image, you might as well make it fully explicit and store the required index for every fixel, rather than making assumptions about the ordering within each voxel: in the current implementation the user requests memory for some number of elements in a voxel, but they are not forced to write the data for those elements in sequential order.

Overall,

I agree that a format that allows dynamic growth with minimal assumptions about data content would be great. But if by trying to come up with a completely generalized solution, we end up with something relatively unwieldy when it comes to actually using it (which includes potentially being broken by a user renaming a file within the sparse contents directory), I think it's a decision we and other users would resent. In its current state, there is still no assumption in the format that the data involved are fixels: that comes from the template class used currently.

The problematic data redundancy comes from using the current fixel class as the output of fixel statistics; as per-subject data it serves its purpose just fine. So if there's a better solution for handling specifically the output of fixel statistics, that doesn't require a massive format change and still lets me do the things I designed the current format to allow me to do, that's still my preference. That could be the FixelIndex class, or (if possible) it could be embedding all data relevant to fixel statistics within a single large fixel class (waiting to hear from Dave to know whether or not this is actually possible).

Lestropie commented 9 years ago

Just a minor thought:

In fact, I think this actually makes the format incomplete - a bit like vendors adding private tags to DICOM images.

Would you consider the format more 'complete' if the image header included a printout of the elements of the class? Should be possible with a bit of macro trickery.

Edit: This would actually be a neat way to improve the sparse class type check: Currently it's based on a typeid().name() and a sizeof(); but the former is compiler-specific, so may not always provide file compatibility between systems.

thijsdhollander commented 9 years ago

Ok, there we go. I will post a few (constructive) replies. I'm again a bit limited in time; that's why some things are maybe a bit simplified (but I can always elaborate over time, the discussion is an open one any way). I'll first clarify a few things I initially posted way back around the beginning of this issue's thread, as it seems they provoked some strong emotions (and rest assured: they really shouldn't).

  1. It seems my argument against the argument of a "status quo" was given a bit more weight than it should've. To clarify: I don't mean to say that it invalidates any other arguments or something like that. I was only referring to the "status quo" argument by itself. By saying "status quo is not (very much of) an argument for me", I'm also not saying that is an argument in favour of changing just for the sake of changing. What I am saying is that "if there are better ways to do this, then at this specific point in time (before the big release, if you will) we should not be bothered with a status quo".
  2. A bit the same thing for what I was -admittedly in a silly manner- stating about "let's be brave scientists" (or whatever). It's not an argument to change things "just because"; it's an argument to not not-change things just because "status quo". That's all. :smile:
  3. There's been other elements in this thread now already indicating that we're in the middle of big changes as part of the syntax overhaul anyway. So this is a good opportunity to consider this whole issue at hand. That's again not shouting "change just because".
  4. Based on all the opinions that have been provided so far, I think it's clear that all people involved feel strongly about getting this right. Dealing with fixel data (or even more generally, sparse format data) is (going to be) a pretty central and important element of MRtrix. So it's also kind of our job to get this right. We have the unique opportunity now to do that, and part of that is discussing what we feel "right" actually is in this context.
  5. (I feel I'm making the same point here 5 times, but...) we'd be pretty lazy scientists if we can't just openly discuss this without getting personal emotions involved. Because it was already a pretty heated debate to begin with, I've put in that little disclaimer somewhere in my initial opinion as well: this is not a winner-loser debate, this is a winner-winner debate where MRtrix (and thus all of us as well) wins in the end. No pressure: we're all clever people; putting our minds and opinions together can only yield a better outcome. We're already winning a priori. :+1:
  6. All of the above implies this one, but still: just because more than 2 people have the same opinion, doesn't mean they're ganging up on anyone. It just means we might at some point be a small step closer to an outcome.

Now, on to actual content. I think I need to clarify a bit here as well. When I'm virtually shouting all the time about how "this is about fixels", and about all the (heavy all over the place) implications associated with it....

  1. I don't mean that this discussion affects the actual definition of fixels. That one is given, written, part of science as it stands, etc...
  2. I don't mean we shouldn't care about the implementation/coding side of things.
  3. ... or more generally, technical implications of choices we make. I understand memory and disk space are important resources, and that the sparse file format was actually specifically created to deal with these resources in a non-wasteful manner.

Now what I do mean is that we shouldn't be naive. We know that introducing new terminology is difficult. Fixels are no exception. As much as we'd like them to already be common diffusion slang, I think I can safely say the term/concept is not yet widely adopted. And as much as we see them as a logical thing, and as much as we ourselves can clearly separate the concept of fixels from the actual ways in which they happen to be implemented or are part of an implementation somewhere in this software we call MRtrix, I don't think we can assume that every user has these capabilities. But that's not a problem, since in between us as developers and the actual end user, there's this magical device called the user interface. It's not because we make complex (but efficient, well informed, etc...) choices on the implementation front, that the user should suffer. Now when I talk about users and user interfaces (and before I'm misunderstood), let me define what I mean by them (specifically) in the arguments that I made:

  1. When I just said "users", I'm referring to the far end of the spectrum; the end-end-end-user if you will. Not the one that knows C++ and will be using the library to implement their own stuff. Not even the one that will script up things that interact with the commands externally. Maybe the one that is just comfortable with command lines. Or maybe not even that yet. But just the one that wants to do his processing, and has heard about this fixel-based analysis somewhere (but didn't fully read the related papers yet).
  2. When I just said "user interface", I'm not just only referring to the viewer and the commands, but also to the files as the commands spit them out. Important here: that doesn't say anything about how classes-and-all-that-technical-whatever we play around with exactly works. It's everything that that user I just mentioned deals with. That's how we interface with that user.
  3. You could say a manual is also part of the user interface. That is true. But that doesn't mean it's a good idea to cram 50 pages of technical justification in there, as a requirement for the user to be able to interact with the software (or even appreciate its interface). A manual is important for reference, but it shouldn't become a burden up front. More importantly, you can not even expect that the manual will be each user's starting point. Some anarchistic users will start playing with the software right anyway, and happily ignore the manual. You could say that's their choice, and that's right. You could also realise however, that if because of that, they get another idea of something (let's say, the concept of fixels), we might actually end up losing. You'll see all sorts of weird interpretations of fixels everywhere, or even broken results (not even taking into account the amount of time we have to waste on "support", because the user interface wasn't self explanatory). No good publicity for the fixels, not doing a good job of our intention of spreading the understanding of fixels. So when I say "this is about fixels", I mean that the software is also sending out messages (as are papers, manuals, etc...). We can say it's "not our job", and "it's the users' responsibility to go and read the manual". But how silly is that, if we can actually fix some of these things up front by providing a sensible interface? If because of high entry requirements for MRtrix, they go with another software/concept and start comparing FA values and call them "white matter integrity" in their ISMRM presentations, we'll only end up frustrated in those sessions as well. :imp:

To conclude this: we can only gain (a lot) by having a good interfacing mechanism towards the user. One that is as self-explanatory as it can get. One that adheres to the understanding the end user has of the concepts that we offer. And thus, one that also externalizes these logical concepts. In this kind of reasoning, for example, DICOM sucks big time as a good format/"standard": it's a container for everything, yet nothing specifically. You might've just handed the user a blank piece of paper, and they could draw their images on there however they wish (and we'd all scan and print these images back and forth to each other or something).

I steer away from some of the technical aspects quite often in most of my arguments, because I see they're already covered more than enough in a lot of you guys' arguments. This is just my way of bringing in some other aspects that I think are important as well. :wink: I'm not waiving off the other arguments. As an example: yes, I also care about data storage size. :smile:

Ok, next I post, I will quickly post something with the intention of moving forward to an actual solution... just wait a sec...

thijsdhollander commented 9 years ago

So I guess I can keep this quite simple actually. To begin with, to answer @jdtournier 's

OK, I'm getting totally conflicting messages about what the problems are and what each person is after in this discussion.

@jdtournier : reading your first post (the one that pulled you off the beach :sob: ), I'm getting the strong feeling that our opinions and general ideas on this are highly correlated. You actually worded some of the things that I tried to word in better ways. That came in really helpful when I was rambling a lot without possibly saying much. :+1: I also think it's indeed crucial to realise that

So let's point out the obvious: we can't have our cake and eat it. Something's going to jar somewhere.

If file size / data storage wasn't an issue (but I know it is), I'd have one "fixel file" (that happens to be implemented in a sparse format, but that's just a technical behind-the-scenes aspect of it) per "value". That way each "value" could be interfaced with in the most straight forward and clear manner possible. Files could then optionally maybe also be thrown together in folders, for those use cases where you need a hefty amount of (kinds of) "values". But each file would also happily live independently. Obvious issue already pointed out by all of you: it would replicate the indexing mechanism as well as the directions per fixel, for each "value". That's a big no-no, I agree.

Just putting each of these things we don't want replicated in their own files, solves this. It's extra files though, so that's where it jars a bit. I have to agree, this also slightly annoys me; but I personally find it the least annoying of all the things that have the potential to "jar" in other solutions. Using folders as @jdtournier suggested actually mostly relieves this annoyance. It could actually open up possibilities as well: if a fixel-based analysis type of scenario could do its fixel-matching type of step(s) beforehand in certain scenarios (I'm being very general here on purpose; I'm not referring to any specific (or existing) scenario that might work in another way), you could actually have a single fixel-indexing-file and a single fixel-direction-file for that whole analysis. Maybe there'd then just be a fixel-value-file per subject, and everything would be together in a folder (because all these fixel-value-files related to the same indices and directions). It doesn't get more efficient data-size-wise than that. Warning once again: this is just a potential hypothetical use case or scenario; I know that there's also scenario's where you'd want unique "fixels" (indices/directions) e.g. per subject or per [add in other grouping concept here].

As a side note: this matches conceptually quite well with the "logic" we also have in place already for track files and track scalar files. The tracks (the structure / segmentation / ... if you will) are in their own file, as well as potentially different scalar files that need the tracks most of the time to be meaningful. We also don't put these things together, hidden in a single file. Interfacing makes sense. Writing matlab scripts as well.

So I actually find @jdtournier 's proposal an excellent basis to work from. There's definitely choices still involved in that one as well (e.g. maybe we can make the file-naming requirement a bit more flexible, by including something in each fixel "value" file's header; maybe files don't need to explicitly link to each other, but the link is "implicit" because applicable commands can expect to find only a single indexing/direction file in each folder and warn the user if there's none/multiple; etc... etc... warning: I didn't fully think these vague ideas through at this point). But rather than pinning completely different proposals against each other, we could try to at least agree on a common basis?

So just to be very clear again: in my personal opinion, @jdtournier 's proposal is one that ticks all the boxes for me to be a great basis for a solution/outcome to this issue. :+1:

thijsdhollander commented 9 years ago

Final side note: for the transition, I think a conversion tool would probably work well. Because the differences are quite externalized, it should probably have less potential to cause issues as compared to the SH basis switch (where it was potentially less obvious to users whether they were working with the right version or whether it was already / or needed any conversion).

jdtournier commented 9 years ago

There's so many different bits to this discussion, I'm thinking of putting in a request for GitHub to allow sub-threads...

Unique IDs

So I'll start with the unique ID discussion: I clearly had misunderstood things here, I'd assumed that the unique ID would also be stored in the value file, one for each row, to allow totally arbitrary ordering. I guess if it's just a row number, then that's essentially what I had in mind too...

So the only difference relates to storing the unique IDs explicitly versus implicitly as the simple combination of fixel count and index of first fixel for each voxel. I can see the slight increase in flexibility the former would bring in how the fixel values are stored, but I don't see that benefit being meaningful in practice. To expand:

  1. the order of the per-voxel fixels in the original file is fixed: all you get is the offset to where they are in the file. If you don't need that flexibility for the main values, why would you need it for any additional associated values?
  2. once the indices are set in the main FixelIndex file, they're fixed across all associated value files. So if you need the flexibility to reorder for one type of file, you can't then change that for any other associated file. And if you generate your FixelIndex map without any particular regard for ordering (i.e. that particular application just used the original order), you can't then change it for subsequent commands where the order might matter.
  3. we discussed recently how sparse storage would be used in practice, and it was stated that applications would do all the processing in RAM-only dedicated classes, with the data written out as a final separate step. If this is the case, then you'd need to actually go out of your way to write your results in a different order than what was in the original index file.

To me, these points combined argue strongly against storing IDs explicitly, given that the alternative approach is actually simpler, more compact, and likely to lead to better computational and storage performance (at least at read time). I'll expand on the performance issues here, by looking at how various operations would be handled in both cases. For the first two points here, bear in mind that every pointer dereference brings with it the possibility of a cache miss, which can cost a few hundred lost CPU cycles due to latency in the fetch from main memory. So it's worth trying to limit the number of dereferences as much as possible.

  1. reading fixel count - one dereference into index image, then:
    • current format: one branch point (to check for null pointer), and pointer dereference if non-null
    • proposed format: nothing further
  2. accessing fixel value - as above, then:
    • current format: one pointer dereference to get fixel data, a further deference once index for fixel has been read.
    • proposed format: a single pointer dereference directly into associated value file (since offset is computed directly from index image).
  3. storage requirement for index file only:
    • current format: 1 byte offset (8 bytes) per voxel + 1 count (4 bytes) per populated voxel + 3 directions (3x4 bytes) and 2 values (2x4 bytes) per fixel. 64-bit byte offset is required to access potentially large files, and introduces no restriction on file size (a good thing). However, use of a byte offset means it is not immediately compatible with the concept of an associated file (hence the need for a FixelIndex class).
    • proposed format: 1 index (4 bytes) and 1 count (4 bytes) per voxel in a totally standard 4D, 2 volume image. Directions and values stored separately, but it would be possible to link these to each fixel by reference to their corresponding index file (whether determined implicitly through folder layout or explicitly in the header). The use of a 32 bit index restrict maximum number of fixels (but not of the actual image, or of the amount of information per fixel) to ~4 billion, which is a good 3 orders of magnitude greater than currently needed even for an oversampled analysis. I think by the time we manage to achieve the ~10-fold increase in resolution along all spatial axes needed to get to the stage where the number of fixels exceeds that limit, MRtrix will have outgrown this format - or simply switched to 64 bit values for these quantities. To put it another way, if we reach this limit while storing a single 32-bit float value per fixel, we'd be talking about a 16GB value file - a lot bigger than the ~10MB we're currently talking about.

All this to say, if the discussion is about what the best approach is regardless of what has been done so far, I'd say my proposal for the index image (i.e. for each voxel, fixel count + index to first fixel) is preferable. However, it's clear that the current sparse format wasn't designed to do this, so the only sensible way to manage associated files assuming we have to stick with the current sparse format, is to store explicit unique IDs / row numbers - i.e. the FixelIndex approach.

Another (minor) advantage to storing using my proposed format is that loading the image in the viewer would show a map of fixel counts per voxel - the nuFO map, as Flavio would call it. A neat benefit, given #276...

The sparse format

So I've mentioned before that I consider the current sparse format incomplete because it doesn't have a single clear specification that can be relied on by 3rd party tools (including our own for that matter) without inspecting the MRtrix source code. While including the per-fixel storage layout in the header would at least allow each file to document how its data are stored, I don't think it gets over the bigger obstacles that I'd mentioned before. There is no simple way to design a generic sparse format reader for this type of image without essentially designing some type of bytecode representation of the data - which is getting us pretty close to a full-blown interpreted language.

The complexity needed for applications to simply read or display the values, and the associated performance penalty, would in my opinion be too great. One option is to process all the data using the bytecode representation I'd mentioned above, which is bound to be more computationally involved. The alternative is to design every possible command that accepts sparse data to have knowledge of every class that has been stored as sparse data so far, but that means the amount of code branching would become larger and larger as the number of classes stored using the sparse representation grows. And we'd have to maintain support for at least some legacy classes too, even if we decide to deprecate them - as soon as the code is released, we have to assume some people have generated data in this format. Imagine having to design the viewer to support all the known sparse classes. Then imagine adding such a class to the codebase - you'd have to add an additional branch point in the viewer just to display your results. And then imagine you modify this class slightly: now you need to add another branch point for that, but you can't get rid of the old one because the previous code was already public. This is simply unworkable, and would rapidly become a maintenance nightmare. OK, the viewer is one specific case where a bytecode representation might work - it wouldn't necessarily be performance-limiting. But any other command intended to operate generically on sparse data would be subject to this problem.

And I really don't think there's any great need for this. Of the example use cases discussed so far, just about every single one of them has involved storing a vector of floats per fixel - the only exception to this was a couple of boolean flags that might be needed for an application yet to be developed. There would be nothing stopping someone from taking one of the float entries and using it as a bitfield - it would be a bit unorthodox, but not particularly surprising. More to the point, it would mean that we can have a simple format for sparse storage, that allows direct mapping if needed (either as a C array, or using Eigen's mapping API, or by casting the pointer to a Point, or indeed any other class, etc.), and allows other applications to simply use the data in a very generic way. This means displaying these data in the viewer would 'just work' without needing any modifications. Obviously the values would be a bit meaningless when displaying floats that actually correspond to bitfields, but all the other values would display just fine. I think there are massive advantages to being able to do this - at the cost of some flexibility that isn't currently needed, as far as I can tell.

Self-contained or split file format

The last remaining aspect to this discussion is whether we allow split format, or stick to a single self-contained format. If we allow a vector of values to be stored per fixel within a single sparse image, then this would allow applications such as the fixel stats to store a long vector of results alongside the directions and everything else - all that would be needed is a description of what each volume corresponds to, which could easily be included in the header (as good idea, since that would make it much easier to inspect the results in the viewer, as you could select the desired value by label rather than index). This would at least reduce the amount of redundancy we currently have, and ensure each file remains self-contained and independent.

The downsides to this are, as discussed before:

The alternative is to store everything in separate files, with either implicit dependencies (i.e. using a known folder structure and filenames), or explicit dependencies (using header entries). I'd personally favour the former, since it provides more clarity as to what belongs with what (it's all on the filesystem, no need to inspect headers), and allows for a richer set of interactions (can load the whole folder at once, or a specific value file, or indeed the mapping itself). I appreciate people's aversion to allowing such inter-dependencies between files (this includes myself, by the way), but I think on this occasion the benefits far outweight the minor risks that this might introduce. This is after all how HTML pages reference each other, how code gets compiled, and how PowerPoint (used to) include movies... ;) And anyway, the minute the user tries to use their data, they'll probably twig that renaming the index map from index.mif (or whatever) was a Bad Idea, when the error message is "can't find sparse index image index.mif". And of course, we can (should) always document all this on the wiki...

Anyway, I'm not sold on using a single file or a split file format at this stage. I'd prefer the split format since it will definitely prove more flexible in the long run, but I can understand people's reluctance on that one. I do feel however that given half a chance, we'd actually find it quite easy and intuitive to use. But that's just my 2 cents - really this decision will come down to the long-term usage patterns this leads to for end-users (e.g. on the viewer), and the actual storage gains it affords in practice.

All for now - hopefully I've covered all the issues I wanted to discuss...

draffelt commented 9 years ago

OK, so I've had a nice little holiday from this thread. Although my weekend in resplendent rainy Rosanna was probably not as nice as the Greek islands. Btw @jdtournier , @rtabbara wants to know if you have bought an island? If so, can we use it for an biannual MRtrix conference? It would be preferable to have these long discussions in a bar by the beach.

It's taken me close to 1 hour of reading to catch up, so I'm not going to say much. Now that I've got a better understanding of what Donald has proposed, I'm definitely in favour for the following reasons:

draffelt commented 9 years ago

So I'm coming around to the idea of the viewer being able to detect and load all the data files in the folder, which can then be used to selectivity to colour, scale and threshold dynamically. I guess we could just make sure that what ever data file is selected in the file chooser, is the one that colours the fixels upon load. For example, since it's unintuitive to use the file chooser to select a folder, we would need to select a file. If we chose the fd_stats/directions.msf file it would then colour by direction. It would still detect the rest of the files (tvalue, pvalue etc) and load them, which would be settable via a dropdown for each of colour, length, and threshold. If we chose the load the pvalue file, then it would colour by pvalue upload load (but still present the other data files for colouring).