proposal to add metadata to parallel /nirs(i)/stim(j)/data

dboas commented 4 years ago

Some users would like to have metadata attached to each stimulus trial that they can then utilize to include or exclude trials from averaging. This is quite useful, for instance, with infant looking time during each trial.

To permit this, we would need to add an optional matrix of values something like /nirs(i)/stim(j)/metadata This would have to have the same number of rows as /nirs(i)/stim(j)/data But the number of columns is unconstrained

/nirs(i)/stim(j)/metadataNames This would be optional and provide a useful descriptor name for each column of the metadata.

@fangq @sstucker @jayd1860 and others Please comment on any issues with adding these optional fields to SNIRF.

sstucker commented 4 years ago

Alternatively: /nirs(i)/stim(j)/metadata/data /nirs(i)/stim(j)/metadata/names

fangq commented 4 years ago

hi @dboas, I agree it could be useful to add metadata to stim. what type of metadata do we expect? specifically, do you think this metadata field should be used to describe each stim channel? or just general channel-independent metadata?

basically, channel independent metadata can be handled similarly to /nirs/metaDataTags; channel-dependent metadata can be handled similarly to /nirs/data/measurementList, as long as these metadata can be categorized into a set of standardized fields. We should probably ask for some examples to assess which path to go.

dboas commented 4 years ago

@fangq sorry for not being clear. We presently have /nirs(i)/stim(j)/data From the spec this is described as: This is a three-column array specifying the stimulus time course for the jth condition. Each row corresponds with a specific stimulus trial. The three columns indicate [starttime duration value]. The starttime, in seconds, is the time relative to the time origin when the stimulus takes on a value; the duration is the time in seconds that the stimulus value continues, and value is the stimulus amplitude. The number of rows is not constrained. (see examples in the appendix).

Basically, there are use cases where users want additional columns to provide metadata values for each stimulus trial (i.e. each row of /nirs(i)/stim(j)/data). The most likely use cases provide behavioral data associated with each stimulus trial as values in these entries. We don't want to expand the number of columns for /nirs(i)/stim(j)/data, so we thought it better to add an optional matrix of values /nirs(i)/stim(j)/metadata which would have the same number of rows as /nirs(i)/stim(j)/data and have an arbitrary number of columns. In addition, we can have an optional /nirs(i)/stim(j)/metadataNames that would provide a useful descriptor name for each column of /nirs(i)/stim(j)/metadata

Is this more clear?

It would be nice to have @huppertt thoughts as well.

fangq commented 4 years ago

@dboas, what was the concern to extend the columns of /stim/data? in my opinion, changing the language to accept more than 3 column in /stim/data is the easiest way to accommodate such data, it is also the most efficient way to read and write.

if one also intends to annotate these additional data columns, such as defining the meaning of each extra column, we can add such lightweight metadata in the form of dataset attribute, supported in hdf5. My easyh5 parser already supports reading attribute data if exist in the file.

https://github.com/fangq/easyh5/blob/master/loadh5.m#L287-L293

dboas commented 4 years ago

My initial reason for separating it into another field is because our /stim/data format currently exactly matches what is ubiquitous in the fMRI community. I don’t really want to change that.

I do hope that Ted can provide some guidance on this.

From: Qianqian Fang notifications@github.com Reply-To: fNIRS/snirf reply@reply.github.com Date: Wednesday, August 12, 2020 at 2:17 PM To: fNIRS/snirf snirf@noreply.github.com Cc: "Boas, David" dboas@bu.edu, Mention mention@noreply.github.com Subject: Re: [fNIRS/snirf] proposal to add metadata to parallel /nirs(i)/stim(j)/data (#47)

@dboashttps://github.com/dboas, what was the concern to extend the columns of /stim/data? in my opinion, changing the language to accept more than 3 column in /stim/data is the easiest way to accommodate such data, it is also the most efficient way to read and write.

if one also intends to annotate these additional data columns, such as defining the meaning of each extra column, we can add such lightweight metadata in the form of dataset attribute, supported in hdf5. My easyh5 parser already supports reading attribute data if exist in the file.

https://github.com/fangq/easyh5/blob/master/loadh5.m#L287-L293

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/fNIRS/snirf/issues/47#issuecomment-673032969, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHFCP5BC27T6HFFQOJJI34LSALMDRANCNFSM4P3QWKDQ.

sstucker commented 3 years ago

A Homer3 user brought up a case in which filtering or otherwise organizing stims based on some scalar metadata value they each have would be useful.

@jayd1860 I'm also realizing that it would be great to move the status of the stims (enabled/disabled) in Homer to one of these metadata columns. How do you feel about Qianqian's proposal (below) we make stim/data an indefinite number of columns as opposed to a new stim/metadata array?

@dboas, what was the concern to extend the columns of /stim/data? in my opinion, changing the language to accept more than 3 column in /stim/data is the easiest way to accommodate such data, it is also the most efficient way to read and write.

if one also intends to annotate these additional data columns, such as defining the meaning of each extra column, we can add such lightweight metadata in the form of dataset attribute, supported in hdf5. My easyh5 parser already supports reading attribute data if exist in the file.

https://github.com/fangq/easyh5/blob/master/loadh5.m#L287-L293

dboas commented 3 years ago

I am fine with extending /stim/data to have more than 3 columns as long as the first 3 columns maintain their same format that matches what is ubiquitously used in fMRI as @fangq suggested above.

We also need to annotate these extra columns, i.e. give them descriptions. @fangq suggested the follow. How would this work in practice?

if one also intends to annotate these additional data columns, such as defining the meaning of each extra column, we can add such lightweight metadata in the form of dataset attribute, supported in hdf5. My easyh5 parser already supports reading attribute data if exist in the file.

sstucker commented 3 years ago

Would extending the number of columns of stim/data break existing implementations?

sstucker commented 3 years ago

I have a working implementation of the stim metadata which expands the stim data array to have more columns. This implementation isn't incompatible with anything with Homer3 so far.

This approach allows a list of metadata IDs to be specified with additional strings, i.e. for an Nx4 data array, the metadata ID list would have one entry (such as 'col_str_1') and for an Nx5 data array the metadata ID list would have two entries... the columns of the data array from L to R being for [onset, duration, amplitude, col_str_1, col_str_2]

I have stored the metadata column id as an attribute as @fangq suggested:

Unfortunately MATLAB does not support the use of string arrays for HDF5 attributes, so saving a true C style array of strings would be difficult. I've formatted it as csv for now.

I guess I prefer storing the metadata column IDs as an array of strings next to data, anyway (/nirs(i)/stim(j)/metadataIds or something) as it seems more consistent with the rest of the spec.

fangq commented 3 years ago

@sstucker and @dboas, sorry for late reply, was working on a proposal deadline before Thanksgiving.

Would extending the number of columns of stim/data break existing implementations?

it should work out of box for Easyh5 because it does not particularly validate data against SNIRF (but reading/writing them as generic data records). I do expect that the validation script @huppertt wrote need to be updated to allow this.

Unfortunately MATLAB does not support the use of string arrays for HDF5 attributes, so saving a true C style array of strings would be difficult. I've formatted it as csv for now.

can you try the low-level API (H5A.*) to see if you can save such data? see example code here

https://github.com/fangq/easyh5/blob/master/saveh5.m#L342-L345

sstucker commented 3 years ago

In my experience the MATLAB H5A API allows you to save a cell array of strings as data but not as an attribute. This was confirmed here. I am sure it could be hacked but it is perhaps not worth it when we can just add it as a new field rather than an attribute.

dboas commented 3 years ago

@fangq can we just add it as a new field?

fangq commented 3 years ago

Please see my above commit

https://github.com/fNIRS/snirf/commit/2ab9e99d2876d0dcaca353a5ea842957f659c99a

because the attribute is directly associated with the data, so I don't see the needs to use long attribute names, so I picked names for simplicity. Let me know if this makes sense.

dboas commented 3 years ago

@fangq , this is the only use of attributes in the SNIRF spec. Seems inconsistent to use it here when we never used it before.We could have used it for stim.name. We could use it for probe.sourceLabels, provbe.detectorLabels, and aux.name.

Further, in HDF5, as far as we understand, an attribute is not intended as data for the user. But in this case, stim/data/name is defined by the user and thus very much intended for the users benefit. Another way of saying this is that the value of 'name' in no way impacts how we load the data.

We thus argue that we NOT use attribute for stim/data/name but instead make it a separate field very much like stim/name We suggest calling it stim/dataLabels since it is a string array and this makes it consistent with probe.sourceLabels and probe.detectorLabels

sstucker commented 3 years ago

See the proposed implementation in the BUNPC fork of this repository

fangq commented 3 years ago

@dboas, I agree there is a bit inconsistency - we could have used attributes for those fields in the first place. To make things simpler, I update the language and added stim.dataLabels, similar to probe.sourceLabels. Let me know if this looks better.

@sstucker, I just noticed, after committing my edits, that you have already committed something similar in BUNPC fork - although in the commit cited above, I saw you had only the sync of my previous commit.

I see our changes are similar, and your dataLabels are labels for all columns after 3. @dboas, let me know which version you prefer. In either case, I would like you to create a pull request so we can manually merge.

Also, @sstucker, please use pull-request feature to sync your version with upstream (fNIRS/snirf). Do not manually create identical commits in your fork, it can make merge difficult.

sstucker commented 3 years ago

@dboas and I agree that dataLabels should have the same number of elements as data has columns so that indexing remains the same. The first three elements of dataLabels should really be constrained to some version of Onset, Duration, Amplitude but it doesn't really matter as it is just for bookkeeping purposes.

This looks good!

fangq commented 3 years ago

@sstucker, my bad, did not see your PR #49. I patched the spec with the minor edits you have. if you see anything else still missing, feel free to create a new PR

fNIRS / snirf

proposal to add metadata to parallel /nirs(i)/stim(j)/data #47