check shape in docval - Githubissues

bendichter commented 6 years ago

pynwb does not enforce that the shape of datasets matches the shape specification in the schema. matnwb expects datasets to follow the size rules specified by the schema, so this is causing a compatibility issue for files that are written in python and read in matlab. For example, it is possible to write a 1-D dataset to the variable ecephys.WaveformClusters.waveform_mean, even though the schema requires that the dataset be 2-D. This dataset will then be written with no problem in pywnb, but will cause a read error in matnwb.

I suggest we add an optional dataset shape checker to docval. The syntax would be:

{'name': 'waveform_mean', 'type': Iterable, 'doc': 'the mean waveform for each cluster', 'shape': (None, None)}

The docval would take either a list of lists or an np.ndarray, and would enforce that the dataset is 2-D.

@ajtritt, what do you think about this?

t-b commented 6 years ago

Yes in general having a shape checker is a good thing to have. I think in your case here you would even need to enforce a specific 2D array size or? If I read the spec correctly it says that you need num_clusters rows and num_samples columns.

oruebel commented 6 years ago

need num_clusters rows and num_samples columns.

From the specification language perspective the dims field is currently just naming/labeling dimensions, i.e., we currently don't yet have a way to express constraints that apply across datasets, e.g., to define that waveform_sd and waveform_mean must have the same shape. We have had discussion about how to express advanced constraints like this but so far this is still in the discussion phase.

I suggest we add an optional dataset shape checker to docval.

In addition to shape I think you also want to have a way to restrict the data type for values in the array. Generally speaking the idea of allowing docval to validate general properties of arrays sounds reasonable to me.

The docval would take either a list of lists or an np.ndarray ...

The shorthand for valid array types for docval is here: https://github.com/NeurodataWithoutBorders/pynwb/blob/91676cc29bf22c284db63379a676136d64827cab/src/pynwb/form/utils.py#L10

There is also a ShapeValidator class in FORM that might be helpful, or at least would be a good place to put the functionality needed to inspect and validate shapes of arrays.

https://github.com/NeurodataWithoutBorders/pynwb/blob/91676cc29bf22c284db63379a676136d64827cab/src/pynwb/form/data_utils.py#L288

bendichter commented 6 years ago

@oruebel cool, thanks for pointing this out to me. However, importing ShapeValidator in form/utils causes a circular import. Any suggestions how we should resolve that?

oruebel commented 6 years ago

I assume the circular import is due to docval. I think the cleanest solution here is to: 1) not use docval in ShapeValidator and to simply move the information to the docstring, 2) transplant ShapeValidator, ShapeValidatorResult, __get_shape_helper, get_shape into a new .py file. pynwb/form/validate/shapevalidator.py seems like it might be a good place. @ajtritt does that sound reasonable?

bendichter commented 6 years ago

OK. While I'm at it I'm de-classing ShapeValidator. It has no __init__ and only static methods so I see no reason why it should be a class.

bendichter commented 6 years ago

I separated ShapeValidator but I'm running into more import errors:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-88365b6446cb> in <module>()
----> 1 from pynwb import NWBHDF5IO

~/dev/pynwb/src/pynwb/__init__.py in <module>()
      8 CORE_NAMESPACE = 'core'
      9 
---> 10 from .form.spec import NamespaceCatalog  # noqa: E402
     11 from .form.utils import docval, getargs, popargs, call_docval_func  # noqa: E402
     12 from .form.backends.io import FORMIO  # noqa: E402

~/dev/pynwb/src/pynwb/form/__init__.py in <module>()
      1 # flake8: noqa: F401
----> 2 from .container import Container, Data, DataRegion
      3 from .utils import docval, getargs
      4 from .data_utils import ListSlicer
      5 from .backends.hdf5.h5_utils import H5RegionSlicer, H5Dataset

~/dev/pynwb/src/pynwb/form/container.py in <module>()
      1 import abc
      2 from six import with_metaclass
----> 3 from .utils import docval, getargs
      4 
      5 

~/dev/pynwb/src/pynwb/form/utils.py in <module>()
      8 from six import raise_from
      9 
---> 10 from .validate.shapevalidator import get_data_shape
     11 
     12 __macros = {

~/dev/pynwb/src/pynwb/form/validate/__init__.py in <module>()
      1 # flake8: noqa: F401
----> 2 from . import errors
      3 
      4 from .validator import ValidatorMap, Validator, AttributeValidator, DatasetValidator, GroupValidator
      5 from .errors import *

~/dev/pynwb/src/pynwb/form/validate/errors.py in <module>()
      1 
----> 2 from ..spec.spec import DtypeHelper
      3 from numpy import dtype
      4 
      5 

~/dev/pynwb/src/pynwb/form/spec/__init__.py in <module>()
      2 import os.path
      3 
----> 4 from .spec import NAME_WILDCARD
      5 from .spec import Spec
      6 from .spec import AttributeSpec

~/dev/pynwb/src/pynwb/form/spec/spec.py in <module>()
      5 from warnings import warn
      6 
----> 7 from ..utils import docval, getargs, popargs, get_docval, fmt_docval_args
      8 
      9 NAME_WILDCARD = None

ImportError: cannot import name 'docval'

I'm not sure the most efficient way to re-organize in order to fix this cyclic import. docval is used quite extensively in spec.spec

oruebel commented 6 years ago

Did you remove docval from shape validator? Are you making changes in spec.spec or this is a result of the refactor?

bendichter commented 6 years ago

I did remove docval from shape validator. I have not made changes to spec.spec yet. I could if that's the best approach, but this is turning into a lot of docval -> docstring. I'm fine with doing that if it's the best approach, but I didn't want the PR to be a shock.

https://github.com/NeurodataWithoutBorders/pynwb/tree/shape_checker

oruebel commented 6 years ago

Thanks for the pointer to the branch. It just wasn't clear to me what changes you made, i.e., I was not suggesting to change spec.spec . My schedule is crazy today so I don't have the chance to investigate in detail. From a quick look, I'd try to move from ..data_utils import DataChunkIterator insrc/pynwb/form/validate/shapevalidator.py to make it a function-level import in get_data_shape which seems to be the only place where it is being used. At least my guess is that your picking up the circular import from there.

bendichter commented 6 years ago

actually I putting shapevalidator.py in src/pynwb/form/validate might be causing an issue. It runs src/pynwb/form/validate/__init__.py, which attempts to import form.utils. I might be able to resolve this by moving shapevalidator.py somewhere else.

oruebel commented 6 years ago

@ajtritt do you thing pynwb/form/utils.py would be the right place for the shape validator?

ajtritt commented 6 years ago

If ShapeValidator is going to be used with docval, it should go in form.utils

bendichter commented 6 years ago

@ajtritt

get_data_shape depends on DataChunkIterator and DataChunkIterator depends on get_data_shape, so moving just get_data_shape will cause an import error. Should I move DataChunkIterator as well?

oruebel commented 6 years ago

I don't think DataChunkIterator should move. get_data_shape really just usesDataChunkIterator to check for type. in get_data_shape I think we can simply change the following code:

    if isinstance(data, DataChunkIterator):
        return data.maxshape
    if hasattr(data, 'shape'):
        return data.shape

to

    if hasattr(data, 'shape'):
        return data.shape
    if hasattr(data, 'maxshape'):
        return data.maxshape

That should remove the dependence of the function on DataChunkIterator.

bendichter commented 6 years ago

ok, good idea.

On Thu, Sep 20, 2018 at 12:32 PM Oliver Ruebel notifications@github.com wrote:

I don't think DataChunkIterator should move. get_data_shape really just usesDataChunkIterator to check for type. in get_data_shape I think we can simply change the following code:
if isinstance(data, DataChunkIterator):
    return data.maxshape
if hasattr(data, 'shape'):
    return data.shape
to
if hasattr(data, 'shape'):
    return data.shape
if hasattr(data, 'maxshape'):
    return data.maxshape
That should remove the dependence of the function on DataChunkIterator.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/NeurodataWithoutBorders/pynwb/issues/619#issuecomment-423248949, or mute the thread https://github.com/notifications/unsubscribe-auth/AAziEusBFIsR2Fafk0GB7l50rnnbXeAaks5uc8MHgaJpZM4Ws0HB .

--

Ben Dichter, PhD Data Science Consultant

bendichter commented 6 years ago

@ajtritt mentioned that getting the validator to run on write would solve this problem as well. That's a simpler solution, since it would not involve manually adding shape args to every dataset type. I still think this is valuable, but I agree with Andrew that the validator would solve this problem more quickly.

NeurodataWithoutBorders / pynwb

check shape in docval #619

--