NeurodataWithoutBorders / nwb-schema

Data format specification schema for the NWB neurophysiology data format
http://nwb-schema.readthedocs.io
Other
54 stars 16 forks source link

About the was_generated_by field #598

Open ehennestad opened 2 days ago

ehennestad commented 2 days ago

When I saw that there was a new field was_generated_by I initially thought that this was meant for storing information about which package was used for creating an nwb file, e.g pynwb, matnwb or NWB Guide, which I thought was great.

Only after reading the field description and this issue: https://github.com/NeurodataWithoutBorders/nwb-schema/issues/258, I realised that it is meant for storing information about software used to generate actual datasets / datatypes.

In my opinion, it would be great to have a field in the file dedicated to storing information about which software was used to generate the file (as I first interpreted it).

I also think it would make more sense to add information about which software generated a dataset to the actual datasets (similar to how you can add more detailed metadata to a device). Having a list on the file object itself is a slight improvement, but it requires some work for the user of the file to understand which software applies to which dataset/datatype which is not ideal

stephprince commented 15 hours ago

When I saw that there was a new field was_generated_by I initially thought that this was meant for storing information about which package was used for creating an nwb file, e.g pynwb, matnwb or NWB Guide, which I thought was great.

Only after reading the field description and this issue: #258, I realised that it is meant for storing information about software used to generate actual datasets / datatypes.

In my opinion, it would be great to have a field in the file dedicated to storing information about which software was used to generate the file (as I first interpreted it)

I think the current iteration of was_generated_by is intended to be a catch-all for both types of information that you listed, software used to generate the NWBFile and software used to acquire/generate data (at least until we determine how we want to attach the latter to the actual data).

I think we could clarify the description in the schema and maybe add an example to make this clearer?

I also think it would make more sense to add information about which software generated a dataset to the actual datasets (similar to how you can add more detailed metadata to a device). Having a list on the file object itself is a slight improvement, but it requires some work for the user of the file to understand which software applies to which dataset/datatype which is not ideal

I agree adding the information about which software generated a particular dataset to the actual dataset is a better solution to help users understand which software was used to generate what data.

One potential approach is to add was_generated_by as an optional dataset to the Container data type in hdmf-common-schema so that it is possible to add this optional dataset to all the NWB data types that inherit from Container. Any thoughts on that?

This comment also has a more thorough summary of the provenance information and support we might want to add based on other discussions.