NeurodataWithoutBorders / matnwb

A Matlab interface for reading and writing NWB files
BSD 2-Clause "Simplified" License
49 stars 32 forks source link

[Bug]: Required NWBFile fields are not required by MatNWB #588

Open bendichter opened 1 week ago

bendichter commented 1 week ago

What happened?

identifier, session_description and session_start_time are required by the NWB schema but this is not being enforced by MatNWB.

The official source of truth here is the NWB schema. All neuro-specific types are in /core, and generic types are in the submodule /hdmf-common-schema. These yaml files define all neurodata types used in NWB using the NWB schema language defined here. The information you want is here, where the fields you mentioned are defined in the schema within the NWBFile neurodata type. We are looking for a quantity field of these datasets, which is not defined for any of them, so we go with the default, 1, as defined here in the schema language documentation. This means that all three fields are required.

Steps to Reproduce

Here, session_start_time is missing

nwb = NwbFile( ...
'session_description', 'mouse in open exploration',...
'identifier', 'Mouse5_Day3', ...
'general_experimenter', 'Last, First', ... % optional
'general_session_id', 'session_1234', ... % optional
'general_institution', 'University of My Institution', ... % optional
'general_related_publications', {'DOI:10.1016/j.neuron.2016.12.011'}); % optional
nwbExport(nwb, 'test.nwb')

MatNWB will happily instantiate this object and write it even though this is not a valid NWB file and cannot be opened by PyNWB

Error Message

No response

Operating System

macOS

Matlab Version

2024a

Code of Conduct

ehennestad commented 3 days ago

Add methods to deal with missing required properties: d7eba0db0e11ba3ca60f62ff15f227acb15869b5

ehennestad commented 1 day ago

Here are a few edge cases I could not quite figure out how to deal with:

For TimeSeries: timestamps and starting_time are optional datasets. However the rate attribute of starting_time is required, but only if timestamps are missing?

For ImageSeries: data is required, but not when external_files are specified, ref this: https://github.com/NeurodataWithoutBorders/nwb-schema/pull/462

ehennestad commented 1 day ago

Failing tests after enforcing required properties on export

bendichter commented 1 day ago

For TimeSeries: timestamps and starting_time are optional datasets. However the rate attribute of starting_time is required, but only if timestamps are missing?

The rule is: TimeStamps must have (timestamps) xor (starting_time & rate). If starting_time exists then rate must also exist, which is why it is a required property of starting_time. The schema language does not have a way of expressing an xor condition, so we don't have a way to formally indicate that you must have EITHER timestamps OR starting_time. That constraint must be specified manually by the API.

For ImageSeries: data is required, but not when external_files are specified, ref this: https://github.com/NeurodataWithoutBorders/nwb-schema/pull/462

Yes, this is a tricky edge-case. Since ImageSeries isa child of TimeSeries, the data field is required (children cannot un-require a field). However, when you are using external mode it does not make sense to add data. This would ideally be fixed by reorganizing the parent classes in a way where you wouldn't require data when you don't want to, but the current work-around is to create an empty or near-empty dataset.