Incorrect data dimensions when saving data by snirf_homer3

fangq commented 4 years ago

hi @jayd1860 and @dboas, as more groups are working on adding supports to snirf (see https://github.com/mne-tools/mne-python/issues/7057, https://github.com/brainstorm-tools/brainstorm3/pull/283, https://github.com/fNIRS/snirf-samples/issues/6) we need to make sure the sample files are consistent, not only across our own matlab parsers, but also parsers written with other languages.

@jayd1860, according to our previous conversation, I know you transpose data in the loading phase of a .snirf file instead of the saving phase. This is not an issue if one uses snirf_homer3 for both saving and loading, but can cause an issue for other parsers.

I just did a quick test using python, and the snirf_homer3 generated sample data shows incorrect data dimensions, here are the script

first, installing python hdf5 support (and numpy) via

sudo apt-get install python-h5py python-numpy

then start python, and load the neuro_run01.snirf file that I previously created using homer3's snirf parser

import h5py
import numpy as np

dat=h5py.File('neuro_run01.snirf','r')
d1=np.array(dat.get('/nirs/data1/dataTimeSeries'));
d1.shape
    (18, 8000)

this is inconsistent with our specification, where the dimensions is defined as #time points x #channels.

The transposed data is a result of directly saving a column-major matlab array to a row-majored HDF5 dataset.

I expect that @huppertt's C# parser will also experience similar transpose issues.

In comparison, my easyh5 toolbox transposes the data before saving to hdf5, so it is consistent if the file is opened with other programming languages.

Please let me know if you are able to fix this, and I will be happy to recreate the sample file in the snirf-samples repo so that others can test.

dboas commented 4 years ago

@fangq , @jayd1860 will take a look at this this week and probably make the change you suggest. He will let us know.

jayd1860 commented 4 years ago

@fangq, @dboas, Here's my thinking on this issue. Depending on the language and application the data is stored row or column-major:

Row-major: C/C++,  Python numpy,  HDF5
Column-major: Matlab, Fortran
Neither Row or Column-major: C#, JAVA

The last option - "Neither" - means that the language does not store 2D (or >) arrays in contiguous memory. Only 1D arrays are contiguous and the other dimensions are references to other 1D arrays.

Whichever type of language or application you use - row or column major storage, the HDF libraries will most likely compensate for a difference if there is one and so for instance you can save and load 2D arrays to and from an HDF5 file and get the right answer as long as you're using the same application or applications that agree in their storage order for multidimensional contiguous arrays. The problem happens as you says when the saver and loader of the file don't agree in their storage order.

To me it's a bit arbitrary what scheme you use to fix these disagreements. Common sense might tell you that you should make sure to use row-major storage when saving because it agrees with HDF5 which is row major ... but there are other ways to address this interoperability issue besides what looks best. And in my opinion it is arbitrary which one you choose. Unless you have a formal rule you will get disagreements and interoperability confusion in the future around this issue. So I think what is needed is to add a formal specification about storage order to the SNIRF spec to take the ambiguity out of this issue. For instance you might have this:

option 1: your application must save multi-dimensional arrays in row major order to agree with HDF5 storage order. For example if the application is written in a column-major language, your application must transpose a 2D array when saving it AND transpose it again when loading it, to get back the original array.

or more generally:

option 2: your application must save multi-dimensional arrays using the same storage order as the underlying file format. For example if the file format stores in row-major and the application is written in a column-major language your application must transpose a 2D array when saving it AND transpose it again when loading, to get back the original array.

fangq commented 4 years ago

@jayd1860, in a way, our specification defines the data structure in the hdf5 data structures, not in matlab array forms, nor C/python data strutures, thus, the hdf5 data constructs stored on disk need to meet our dimension/row/column requirements. if a snirf hdf5 file does not meet the spec, then it is not compliant.

also, matlab's h5 api also suggest to transpose the data (and fliplr(size(data))) before saving https://www.mathworks.com/help/matlab/ref/h5s.create_simple.html

fangq commented 4 years ago

@jayd1860, let me do a more indepth test tomorrow. I anticipate one of our parsers will produce the correct dimensions (because snirf-homer3 transpose when reading and easyh5 does the transpose when saving), but for some reason, both outputs seem to have the incorrect dimensions when loading in python. I will need to look at how python's h5py load data.

jayd1860 commented 4 years ago

@fangq, I am doing this now in agreement with you that a column major language should convert to the file storage order (that is to row-major).

However I am finding as perhaps you are as well that the Matlab rules to do this aren't as simple as transpose when saving and transpose again when loading. It looks a bit trickier than that.

I think I found the rule and am testing it now. Will let you know if it works.

Jay

fangq commented 4 years ago

hi @jayd1860, I am about to post my findings - easyh5 also had similar issues, but I just fixed it, and the fix was easy, see my own ticket here

https://github.com/fangq/easyh5/issues/10

and involved changes can be found here

essentially, you call

data=permute(data,ndims(data):-1:1);

before saving or after loading. Now the hdf5 data order and matlab element order agree to each other.

If you don't want to manually permute, you can in theory write a flag in the array header telling hdf5 that the array order is in fortran/matlab, see

https://support.hdfgroup.org/HDF5/doc1.8/H5.intro.html#Intro-PMCreateArray

unfortunately, this flag can only be used in C library, and matlab's array_create function does not support this flag, see

https://www.mathworks.com/help/matlab/ref/h5t.array_create.html

In short, add a permute call before saving, and another one right after loading. that should be all set.

once you update, we can recreate the sample data files, and notify other developers that these files have changed the array layout.

fangq commented 4 years ago

@jayd1860, I recreated the sample data files using my updated easyh5 toolbox

https://github.com/fNIRS/snirf-samples/commit/253e533fe9118a0fea0fac0e6fc5506d737e647d

in this set of files, the dataset dimensions stored in the hdf5 matches the specification (and also agrees the data loaded to matlab using easyh5 and python using h5py).

Please feel free to test your parser using these new sample data.

jayd1860 commented 4 years ago

Hi @fangq,

My SNIRF reader can't read these samples because it expects the data and aux to have index suffixes and it doesn't look like they are there. Whereas I see that stim does have the array subscripts. All 3 are "HDF5 group with one or multiple sub-groups (i.e. an indexed-group)" according to the spec.

Below are excerpts first from current snirf-samples vs what snirf-homer3 generates. I think the snirf-samples .snirf files should have array subscripts appensded to the group name.

=============================== current snirf-samples:

>> h5disp('./snirf-samples/basic/Simple_Probe.snirf','/','min')
HDF5 Simple_Probe.snirf 
Group '/' 
    Group '/nirs' 
        Group '/nirs/aux' 
            Dataset 'dataTimeSeries' 
            Dataset 'name' 
            Dataset 'time' 
            Dataset 'timeOffset' 
        Group '/nirs/data' 
            Dataset 'dataTimeSeries' 
            Dataset 'time' 
            Group '/nirs/data/measurementList1' 
                Dataset 'dataType' 
                Dataset 'dataTypeIndex'                 
                . . . . . 
        Group '/nirs/stim1' 
            Dataset 'data' 
            Dataset 'name' 
        Group '/nirs/stim2' 
            Dataset 'data' 
            Dataset 'name' 
        Group '/nirs/stim3' 
            Dataset 'data' 
            Dataset 'name'

snirf_homer3:

>> h5disp('./Homer3/DataTree/AcquiredData/Snirf/Examples/Simple_Probe.snirf','/','min')
HDF5 Simple_Probe.snirf 
Group '/' 
    Dataset 'formatVersion' 
    Group '/nirs' 
        Group '/nirs/aux1' 
            Dataset 'dataTimeSeries' 
            Dataset 'name' 
            Dataset 'time' 
            Dataset 'timeOffset' 
        Group '/nirs/data1' 
            Dataset 'dataTimeSeries' 
            Dataset 'time' 
            Group '/nirs/data1/measurementList1' 
                Dataset 'dataType' 
                Dataset 'dataTypeIndex' 
                Dataset 'detectorGain' 
                . . . . . 
        Group '/nirs/stim1' 
            Dataset 'data' 
            Dataset 'name' 
        Group '/nirs/stim2' 
            Dataset 'data' 
            Dataset 'name' 
        Group '/nirs/stim3' 
            Dataset 'data' 
            Dataset 'name'

fangq commented 4 years ago

@jayd1860, this is what we decided for the indexed group numbering: https://github.com/fNIRS/snirf/blob/master/snirf_specification.md#snirf-file-specification

In the below sections, we use the notations "(i)" "(j)" or "(k)" inside the HDF5 location paths to denote the indices of sub-elements when multiplicity presents.

I highlighted the condition "when multiplicity presents". I remember this originally was proposed by Ted when we migrated the language from matlab struct to HDF5 structures, and the rule was when there is a singular subfield, then, its name does not have to add 1 at the end, i.e. fieldname is the same as fieldname1. This is similar to the conventions for /nirs, see

All paths must use /nirs or /nirs# (indexed group array)

let me know if this is what you have in mind.

jayd1860 commented 4 years ago

@fangq, @dboas. That's strange, I remember us deciding this only regarding the top level group /nirs, not any group.

Also here's the whole line in question: "In the below sections, we use the notations "(i)" "(j)" or "(k)" inside the HDF5 location paths to denote the indices of sub-elements when multiplicity presents."

It kind of sounds to me like it's not at all talking about having different naming conventions for indexed groups. Rather what it seems to be saying is that indexed groups (vs non-indexed such as probe) are referred to in the text using the above-mentioned index notation. I would say it's not completely clear, but that's my interpretation of it.

Lastly it seems like it would be adding a needless exception to a simple rule and a needless extra step in the code to satisfy that exception. Why do that here?

Maybe Ted can weigh in

fangq commented 4 years ago

It kind of sounds to me like it's not at all talking about having different naming conventions for indexed groups. Rather what it seems to be saying is that indexed groups (vs non-indexed such as probe) are referred to in the text using the above-mentioned index notation.

it kind if says both - indexed groups will be referred to later using these notations, and, the numbering shall be appended when multiple records are included. At least that I had in mind when I added this language in this commit: https://github.com/fNIRS/snirf/commit/25bfac931f7dbe1cdecacbfbf292cbdf030bf424#diff-ee33e2b4d0875712a7909355275c703aR79

Lastly it seems like it would be adding a needless exception to a simple rule and a needless extra step in the code to satisfy that exception. Why do that here?

no, instead, it is meant to be the only rule for all indexed arrays - instead of letting /nirs to be the only one that has two forms.

I like to hear what @huppertt thinks of this.

On a side note, I have to say that appending a number in the group names is probably the most fragile feature in this whole spec. For example, it prevents us from using any group names that ends with a digit, and requires us to check for number continuity when reading&saving. We previously did not have this issue because it was defined in MATLAB struct, and matlab data support struct-arrays and cell-arrays. However, when we decided to store this in an HDF5 structure, we found that struct arrays or cells arrays do not really exist in HDF5, here, everything is named (does not support indexing). I think this is a limitation of HDF5, but right now, I don't see a solution except that the index can be ignored when there is a singular record.

huppertt commented 4 years ago

I thought we had agreed that all "array" fields including data and aux would have a numeric index. So /nirs/data1 is allowed but /nirs/data would not be. So all arrays (even if they had a single entry) still had to be indexed.

The exception to this was the /nirs field itself, which we said could be /nirs or /nirs1. The reason for this exception was that all data except hyperscanning would probably only have a single /nirs entry. Thus, in most cases this would not be an array.

An advantage of hdf5 is that you can add fields without having to rewrite existing ones. Eg you can add a /data2 without needing to rebuild the whole file. If you allow /data to not have a number, then the write code would need to change indices on existing fields to add.

With that said, I think my code handles the case of a missing index (eg /nirs/data) in the read code (because it seemed nicer to catch this and issue a warning but continue than to ab-end with an exception).

Ted

Theodore Huppert, PhD University of Pittsburgh Department of Electrical and Computer Engineering

Email: Huppert1@pitt.edu Phone: 1-412-647-8459 Website: www.huppertlab.net

From: Qianqian Fang notifications@github.com Sent: Tuesday, May 19, 2020 7:39:39 PM To: fNIRS/snirf_homer3 snirf_homer3@noreply.github.com Cc: huppertt dr.huppert@gmail.com; Mention mention@noreply.github.com Subject: Re: [fNIRS/snirf_homer3] Incorrect data dimensions when saving data by snirf_homer3 (#4)

It kind of sounds to me like it's not at all talking about having different naming conventions for indexed groups. Rather what it seems to be saying is that indexed groups (vs non-indexed such as probe) are referred to in the text using the above-mentioned index notation.

it kind if says both - indexed groups will be referred to later using these notations, and, the numbering shall be appended when multiple records are included. At least that I had in mind when I added this language in this commit: fNIRS/snirf@25bfac9#diff-ee33e2b4d0875712a7909355275c703aR79https://github.com/fNIRS/snirf/commit/25bfac931f7dbe1cdecacbfbf292cbdf030bf424#diff-ee33e2b4d0875712a7909355275c703aR79

I like to hear what @huppertthttps://github.com/huppertt to say on this.

On a side note, I have to say that appending a number in the group names is probably the most fragile feature in this whole spec. For example, it prevents us from using any group names that ends with a digit, and requires us to check for number continuity when reading&saving. We previously did not have this issue because it was defined in MATLAB struct, and matlab data supports struct-arrays or cell-arrays. However, when we decided to store this in an HDF5 structure, we found that struct arrays or cells arrays do not really exist in HDF5, here, everything is named (does not support indexing). I think this is a limitation of HDF5, but right now, I don't see a solution except that the index can be ignored when there is a singular record.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/fNIRS/snirf_homer3/issues/4#issuecomment-631144411, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALNU3OSK7UF35T24EQSPBFDRSMKDXANCNFSM4M66BIGA.

fangq commented 4 years ago

I thought we had agreed that all "array" fields including data and aux would have a numeric index. So /nirs/data1 is allowed but /nirs/data would not be. So all arrays (even if they had a single entry) still had to be indexed.

The exception to this was the /nirs field itself

I don't mind explicitly adding language to require all indexed groups to have the index, but this will make two rules - /nirs and others will be handled differently - as opposed to a single rule that we currently have - that both /nirs and others (data, aux, measurementList etc) allow to be is a single group without the index. I personally like the latter case.

if we do want to change this, we will also have to use some new notations, such as converting to /nirs(i)/data(j)/measurementList(k)/sourceIndex to /nirs[i]/data(j)/measurementList(k)/sourceIndex and describe [i] and (j,k) in separate ways. It is cumbersome, but certainly doable.

With that said, I think my code handles the case of a missing index (eg /nirs/data)

my code can also handle missing indices as well. I also wrote a dedicated function, regrouph5 to collapse an index-attached structure field names into a struct/cell array so that one can use the fieldname(index) instead of sprintf('fieldname%d',index) to access these fields.

jayd1860 commented 4 years ago

What Ted is saying about "array fields including data and aux would have a numeric index" is what I remember us saying back then.

dboas commented 4 years ago

So, do we have convergence on the solution? Can someone summarize it?

From: jayd1860 notifications@github.com Reply-To: fNIRS/snirf_homer3 reply@reply.github.com Date: Wednesday, May 20, 2020 at 6:50 AM To: fNIRS/snirf_homer3 snirf_homer3@noreply.github.com Cc: "Boas, David" dboas@bu.edu, Mention mention@noreply.github.com Subject: Re: [fNIRS/snirf_homer3] Incorrect data dimensions when saving data by snirf_homer3 (#4)

What Ted is saying about "array fields including data and aux would have a numeric index" is what I remember us saying back then.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/fNIRS/snirf_homer3/issues/4#issuecomment-631399657, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHFCP5DUFEC7543ZINNX3FTRSOYWJANCNFSM4M66BIGA.

fangq commented 4 years ago

So, do we have convergence on the solution? Can someone summarize it?

in the current spec draft, we use a single rule to describe all indices in a group name - the group index can be ignored when there is a single dataset under this group, no matter it is nirs or data or measurementList.

I suppose what Ted and Jay remembered was that our previous decision was that all indexed group names must followed by an index, even there is only a single dateset beneath; however, /nirs is a exception - when there is a single measurement set, it can be written as /nirs instead of /nirs1.

I personally like the first (the current) naming approach, but am happy to change it to the 2nd if we all agree. The issue is that the notation will be a bit cumbersome because we will have to use separate notations for /nirs and other group names.

huppertt commented 4 years ago

I am in favor of the "that all indexed group names must followed by an index, even there is only a single dateset beneath; however, /nirs is a exception - when there is a single measurement set, it can be written as /nirs instead of /nirs1" version. There is an ambiguity if you allow it both ways for many of the fields when the file (inappropriately) has both entries. In cases like the measurementList, the index needs to match the index of the data. Eg if there were a /measlist and /measlist1, which one describes the first entry of the data? The correct answer (following spec) is that this is never allowed but it would be up to the file reader to handle this as an error exception. At least the code ignoring and not loading unnumbered entries would throw obvious errors when the fields were missing, but the ambiguous issue could result in the data still reading but being wrong (eg nearly impossible for the end user to debug without digging in the snirf read function(

I don't think /nirs has this ambiguity, but if having this one exception causes concern, I would be ok with the "numbered even if there is only one entry" applying to the /nirs1 level too.

Theodore Huppert, PhD University of Pittsburgh Department of Electrical and Computer Engineering

Email: Huppert1@pitt.edu Phone: 1-412-647-8459 Website: www.huppertlab.net

From: Qianqian Fang notifications@github.com Sent: Wednesday, May 20, 2020 9:18:14 AM To: fNIRS/snirf_homer3 snirf_homer3@noreply.github.com Cc: huppertt dr.huppert@gmail.com; Mention mention@noreply.github.com Subject: Re: [fNIRS/snirf_homer3] Incorrect data dimensions when saving data by snirf_homer3 (#4)

So, do we have convergence on the solution? Can someone summarize it?

in the current spec draft, we use a single rule to describe all indices in a group name - the group index can be ignored when there is a single dataset under this group, no matter it is nirs or data or measurementList.

I suppose what Ted and Jay remembered was that our previous decision was that all indexed group names must followed by an index, even there is only a single dateset beneath; however, /nirs is a exception - when there is a single measurement set, it can be written as /nirs instead of /nirs1.

I personally like the first (the current) naming approach, but am happy to change it to the 2nd if we all agree. The issue is that the notation will be a bit cumbersome because we will have to use separate notations for /nirs and other group names.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/fNIRS/snirf_homer3/issues/4#issuecomment-631467847, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALNU3OWNZ3WEI5GLYWBKI7TRSPKBNANCNFSM4M66BIGA.

jayd1860 commented 4 years ago

Again agreeing with what Ted is saying on the naming convention. I don't mind having top level /nirs be the only exception unless someone objects to this.

fangq commented 4 years ago

alright, I will regenerate the sample files and clarify the language.

fangq commented 4 years ago

@jayd1860, sample files are regenerated, please test again.

fangq@taote:~/space/git/Project/github/snirf-samples/basic$ h5dump neuro_run01.snirf  | grep 'GROUP' 
GROUP "/" {
   GROUP "nirs" {
      GROUP "aux1" {
      GROUP "data1" {
         GROUP "measurementList1" {
         GROUP "measurementList10" {
         GROUP "measurementList11" {
         GROUP "measurementList12" {
         GROUP "measurementList13" {
         GROUP "measurementList14" {
         GROUP "measurementList15" {
         GROUP "measurementList16" {
         GROUP "measurementList17" {
         GROUP "measurementList18" {
         GROUP "measurementList2" {
         GROUP "measurementList3" {
         GROUP "measurementList4" {
         GROUP "measurementList5" {
         GROUP "measurementList6" {
         GROUP "measurementList7" {
         GROUP "measurementList8" {
         GROUP "measurementList9" {
      GROUP "metaDataTags" {
      GROUP "probe" {
      GROUP "stim1" {
      GROUP "stim2" {

fNIRS / snirf_homer3

Incorrect data dimensions when saving data by snirf_homer3 #4