NeurodataWithoutBorders / nwb-schema

Data format specification schema for the NWB neurophysiology data format
http://nwb-schema.readthedocs.io
Other
53 stars 16 forks source link

possible to have multiple valid dtypes? #209

Closed NileGraddis closed 5 years ago

NileGraddis commented 6 years ago

This is an issue that @JFPerkins and I ran into with imagemasks. The dtype is specified as float, which makes sense for representing a weighted mask. However, if we have an unweighted mask or one with a discrete enumeration of possible weights it would be more space efficient to store the roi masks as an integer type.

This leads to a general question: can we have multiple valid dtypes for a dataset in the spec? In the schema docs it appears that valid dtypes must be drawn from a list of primitive types and we can't find any references to union types over those.

ajtritt commented 6 years ago

As of now, that is not possible. You would probably need to modify DtypeSpec and/or the classmethod build_const_args.

The use case you describe might be better addressed with a different neurodata_type though. Do weighted masks get treated differently than an unweighted mask or a mask with an enumerable set of values? Do they represent different results? Encoding these different mask types would be better handled with different neurodata_types than primitive data types.

bendichter commented 6 years ago

closely related to #594

NileGraddis commented 6 years ago

@bendichter good find - that feature would be pretty helpful for this case.

@ajtritt I can see both within- and across-neurodata type cases. For instance:

  1. The image mask represents the output of a multilabel segmentation (with no natural order defined over the categories) - this only ever wants to be an integer type, since there is a large finite set of possibilities.
  2. The image mask represents the output of a binary classification. This has a weighted case, which ought to be floating point numbers and an unweighted case, which ought to be boolean. However, these represent the same kind of information; there is just more detail in the weighted case.

I think the changes proposed in #594, which if I understand it correctly involves:

would cover our needs pretty well. There is an additional topic about allowing multiple underlying representations of equivalent data (for instance: sparse vs. dense), but that might be for another time.

bendichter commented 6 years ago

@NileGraddis

There is an additional topic about allowing multiple underlying representations of equivalent data (for instance: sparse vs. dense), but that might be for another time.

We have the H5DataIO class, which wraps data and instructs HDF5 how to store it. These instructions do not change the value of the data or the form- things like compression, chunking, etc. It seems like sparse vs. dense would fit well in there.

ajtritt commented 6 years ago

@NileGraddis thanks for the clarification.

From what you say, I think modifying the specification language is the wrong route for solving your problem. By using dtype alone, you would be saying that ImageMasks.dtype=int implies that this is the output of multilabel segmentation, while ImageMasks.dtype=bool implies this is the output binary classification. We want to avoid encoding special meaning through the use of primitive type.

My suggestion is to make an AbstractImageMask with dtype=null, and then subclasses that represent the different types of ImageMasks.

bendichter commented 6 years ago

@ajtritt That sounds like a good solution to me. Do we currently support dtype=bool though?

https://schema-language.readthedocs.io/en/latest/specification_language_description.html#dtype doesn't list it and we have an outstanding issue for it

https://github.com/NeurodataWithoutBorders/nwb-schema/issues/175

ajtritt commented 6 years ago

@bendichter no we don't, but I created #658 for the necessary changes to PyNWB

NileGraddis commented 6 years ago

@ajtritt Thanks for the fast response. I think we are on the same page about using (rather, not using) data type to encode semantic information.

Accepting multiple types where it makes sense is a pretty handy feature, though. We commonly run into situations where actual data occupies only a subset of the possible values for data of that kind, but those values have the same semantics as they would if the full range were used. A couple examples:

  1. We are doing binary classification on pixels. Our algorithm outputs a value in [0, 1] for each pixel, where fractional values indicate uncertainty about whether the pixel is is in the first or second category. We then switch to an algorithm which does not output confidences, but simply a binary call. The values {0, 1} have the same interpretation in both of these cases, but in the latter we would be wasting 31 bits per pixel by storing the data as float vs. bool.
  2. We are doing multilabel classification with two labels. If there were more labels, we would need ints to store the data, but in this case we can use bools. In each case our users get one unique value per label, but we save on storage.

Anyways, I think the addition of a numeric dtype (and maybe generic int, uint) would do the trick for us. We're a bit focused on file size as we work to switch over to NWB 2.0 fully. This involves writing thousands of files and them serving them out to users on demand. The work that you guys have done to enable compression options is really handy for us and we're always on the lookout for additional optimizations.

oruebel commented 6 years ago

@NileGraddis these specific examples are in my opinion best covered by different neurodata_types as the different dtypes in this case really imply different kinds of masks:

bendichter commented 6 years ago

@NileGraddis FYI I moved this to the nwb-schema repo using the new github feature to move issues. I'm trying to move issues that involve changes to the schema here since they are relevant to other groups e.g. matnwb.

oruebel commented 5 years ago

https://github.com/NeurodataWithoutBorders/pynwb/pull/782 addresses the issue of allowing numierc dtype. However, it does not address the issue raised with actually adding dedicated types for different kind of mask, i.e., FuzzyMask, BinaryMask, and MultiLableMask