Closed NileGraddis closed 5 years ago
As of now, that is not possible. You would probably need to modify DtypeSpec and/or the classmethod build_const_args.
The use case you describe might be better addressed with a different neurodata_type though. Do weighted masks get treated differently than an unweighted mask or a mask with an enumerable set of values? Do they represent different results? Encoding these different mask types would be better handled with different neurodata_types than primitive data types.
closely related to #594
@bendichter good find - that feature would be pretty helpful for this case.
@ajtritt I can see both within- and across-neurodata type cases. For instance:
I think the changes proposed in #594, which if I understand it correctly involves:
would cover our needs pretty well. There is an additional topic about allowing multiple underlying representations of equivalent data (for instance: sparse vs. dense), but that might be for another time.
@NileGraddis
There is an additional topic about allowing multiple underlying representations of equivalent data (for instance: sparse vs. dense), but that might be for another time.
We have the H5DataIO
class, which wraps data and instructs HDF5 how to store it. These instructions do not change the value of the data or the form- things like compression, chunking, etc. It seems like sparse vs. dense would fit well in there.
@NileGraddis thanks for the clarification.
From what you say, I think modifying the specification language is the wrong route for solving your problem. By using dtype alone, you would be saying that ImageMasks.dtype=int implies that this is the output of multilabel segmentation, while ImageMasks.dtype=bool implies this is the output binary classification. We want to avoid encoding special meaning through the use of primitive type.
My suggestion is to make an AbstractImageMask with dtype=null, and then subclasses that represent the different types of ImageMasks.
@ajtritt That sounds like a good solution to me. Do we currently support dtype=bool though?
https://schema-language.readthedocs.io/en/latest/specification_language_description.html#dtype doesn't list it and we have an outstanding issue for it
https://github.com/NeurodataWithoutBorders/nwb-schema/issues/175
@bendichter no we don't, but I created #658 for the necessary changes to PyNWB
@ajtritt Thanks for the fast response. I think we are on the same page about using (rather, not using) data type to encode semantic information.
Accepting multiple types where it makes sense is a pretty handy feature, though. We commonly run into situations where actual data occupies only a subset of the possible values for data of that kind, but those values have the same semantics as they would if the full range were used. A couple examples:
Anyways, I think the addition of a numeric dtype (and maybe generic int, uint) would do the trick for us. We're a bit focused on file size as we work to switch over to NWB 2.0 fully. This involves writing thousands of files and them serving them out to users on demand. The work that you guys have done to enable compression options is really handy for us and we're always on the lookout for additional optimizations.
@NileGraddis these specific examples are in my opinion best covered by different neurodata_types as the different dtypes in this case really imply different kinds of masks:
@NileGraddis FYI I moved this to the nwb-schema repo using the new github feature to move issues. I'm trying to move issues that involve changes to the schema here since they are relevant to other groups e.g. matnwb.
https://github.com/NeurodataWithoutBorders/pynwb/pull/782 addresses the issue of allowing numierc dtype. However, it does not address the issue raised with actually adding dedicated types for different kind of mask, i.e., FuzzyMask, BinaryMask, and MultiLableMask
This is an issue that @JFPerkins and I ran into with imagemasks. The dtype is specified as float, which makes sense for representing a weighted mask. However, if we have an unweighted mask or one with a discrete enumeration of possible weights it would be more space efficient to store the roi masks as an integer type.
This leads to a general question: can we have multiple valid dtypes for a dataset in the spec? In the schema docs it appears that valid dtypes must be drawn from a list of primitive types and we can't find any references to union types over those.