Overhead of channel descriptor groups

samuelpowell commented 2 years ago

We have been internally evaluating the use of SNIRF as a native output format for Gowerlabs' Lumo system.

Lumo is a high density system, and our full head adult caps contain 54 modules, each with 3 dual-wavelength sources and 4 detectors. We are able to provide a dense output, which results in (54 x 4 x 54 x 6 = ) circa 70k channels.

The use of an HDF5 group per channel descriptor (e.g. /data1/measurementList{i}) appears to incur significant overhead. For example, a SNIRF file containing only metadata (no channel data) for a full head system system amounts to ~200MiB, or ~3KiB per channel. The actual information content of each descriptor (containing only the required fields plus module indices) amounts to only (7 x 4 = ) 28 bytes, so this is an overhead of approximately 99%.

Our results appear vaguely consistent with this analysis:

The overhead involved just in representing the group structure is enough that it doesn't make sense to store small arrays, or to have many groups, each containing only a small amount of data. There does not seem to be any way to reduce the overhead per group, which I measured at about 2.2 kB.

Evidently the size of the metadata grows linearly with the number of channels, as does the data rate of the channel time series, and hence for longer recordings the size of the metadata becomes proportionally smaller. However in absolute terms we find that (with appropriate chunking and online compression) the metadata corresponds to around four minutes of compressed raw channel data. Given the length of a typical measurement session, the overhead remains significant.

I appreciate that the majority of systems (such as those of the manufacturers listed on the SNIRF specification page) are of a much lower density than Lumo, and that even high density systems often produce sparse data, but evidently the trend is towards increasing density and the number of wavelengths. Our future products would, based on the current SNIRF specification, generate over 0.5GiB of metadata.

Have you previously considered this?
Might it be possible to use an array of a compound datatype to represent channel descriptors?
Do you have any alternative suggestions as to how we might reduce this overhead?

sstucker commented 2 years ago

This seems to be a fundamental problem with HDF5 itself (there are many).

I see the following possible resolutions to the huge file size:

Change measurementList into a matrix, as you suggest in your second bullet
Drop the requirement for probe representation across files, so that one probe structures would not need to be repeated.
Drop the requirement for probe if some other tabular probe format like the forthcoming BIDS/fNIRS format.

dboas commented 2 years ago

@samuelpowell , excellent issue you are raising. This is very important to resolve. Can you take a look at the channel descriptor tsv file specification in the BIDS fNIRS specification. If you put the probe information in this tsv file and not the snirf file, would that help?

I suspect you are going to say no because this is a text file and so it will still be much larger than 28 bytes per channel. But it should be much less than the 3 KiB per channel that you say SNIRF has. I suspect this tsv file should be less than 100 bytes per channel. That would be more than 30x smaller, or about 7 MiB for you. Is that reasonable as an initial solution? Note that the BIDS spec has the property of inheritance such that you wouldn't have to repeat this tsv file for every run, you could just specify it once per subject or even once per group if all runs/files use in that subject / group use the same geometry.

Of course you also need the optode description tsv file, but that is only number of optodes long and not number of channels.

If this works, then it allows us to use the SNIRF and BIDS standards without having to change the SNIRF spec other than recognizing in the SNIRF spec that if the probe information is missing then one should look for it in the tsv files. This would be much much cleaner than the other options.

Note that option 1 mentioned by @sstucker was debated quite a bit early on in the SNIRF specification by a number of people and it was felt that the detailed descriptor was better than the matrix approach we used in the past in the .nirs file for instance. BUT, we didn't appreciate the HDF5 overhead for such things.

samuelpowell commented 2 years ago

@sstucker @dboas thank you for your prompt responses.

Array of a compound type

Switching from a group per descriptor to an array of a compound type containing the same information reduces the file size in the above scenario from ~200MiB to ~2MiB. It also reduces the time taken to construct the enumeration from ~24 seconds down to ~1s (take the timings with a pinch of salt, it could be that the disparity is because the large amount of data generated in the former case means that HDF5 decides to flush to disk, I haven't done any detailed profiling so I don't know).

If this approach to storing channel metadata is viable, I'd be pleased. Of course, I'm not party to the earlier discussions that you mentioned David, so perhaps there are technical reasons of which I am unaware which preclude this approach.

Use of BIDS channel descriptor

Our motivation for evaluating SNIRF as a native output format (by which I mean written during acquisition by the instrument) is to make things simpler for our end users.

Considerations include:

SNIRF is a single file, replacing our current approach which uses a directory of binaries + metadata (this is sub-optimal for a number of reasons, not least because, to my chargin, you can't ask end users to tar a directory any more)
SNIRF provides a means to get our data into post-processing tools without additional steps
HDF5 readily permits online compression
the underlying use of HDF5 provides real longevity (even without the SNIRF spec)

As soon as we start including additional metadata files, thing become complex again, and it makes less sense.

I make the distinction of a native format because the other option is simply to provide a conversion script such that users can employ SNIRF as an archival format, should they choose. In this scenario I don't particularly care about he metadata overhead - storage is cheap. But as a native format we have to consider its usage not just on a workstation or in the cloud, but also in embedded systems where processing, power, and storage are limited.

fangq commented 2 years ago

@samuelpowell, I am surprised that the channel descriptor could add such a big overhead in HDF5. Maybe try out the JSON/binary JSON wrapper (JSNIRF) of the SNIRF format - see this spec and sample file.

JSNIRF is meant to map losslessly to all SNIRF data fields, can be used as the "source-code format" of SNIRF (because JSON is human-readable), and serve as a bridge to NoSQL (MongoDB/CouchDB) databases in the future. JSNIRF depends on SNIRF, but provides a much portable/lightweight way to read/write, especially suited for lightweight hierarchical data. Check out this poster for details.

Depends on the programming language of your code, I can provide examples of reading/writing such data. We have libraries for Python, MATLAB, C, C++ and JavaScript.

samuelpowell commented 2 years ago

Hi @fangq, your use of a structure of arrays rather than an array of structures would certainly keep the size of the metadata manageable (this is basically equivalent to the suggestion explored in the previous post), and I do see the value of the human readable metadata.

However as noted we are looking at formats from the perspective of online recording, where the channel data (and other auxiliary fields) are updated in real time. I appreciate that in some applications HDF5 is used as a sledgehammer to crack a nut, but for this particular case its use of a chunked data model, backed by a B-tree, is spot on.

Please correct me if I'm wrong, but I don't see how incremental writing of data could work in the JSNIRF specification as it stands, so whilst we'd certainly look at its use as an archival format I'm not sure it fits the bill for what we want right now.

(Another consideration for me when considering commercial usage is that I would want to see your BJData format gaining traction in a couple of high quality and performant C/C++ libraries - I can see you are making good progress through e.g. https://github.com/nlohmann/json/pull/3336.)

I don't want to stray too much from the main purpose of this issue - but always happy to discuss further elsewhere.

dboas commented 2 years ago

@samuelpowell , I was arguing to keep the array representation of the measurement list information, but several others were in support of the specification now adopted by SNIRF. The rationale was simply that storage was not an issue and the adopted specification was more explicit. This was a discussion at the very beginning of the SNIRF spec, circa 2016, with many from the fNIRS field and we didn’t have the HDF5 experience and weren’t thinking of the overhead of large measurement lists… Well, I might say I was since I was arguing for the array representation :)

I am fine trying to get the array representation into SNIRF as an either / or proposition to what is presently there. I’d prefer that myself. And I suspect we can do it in a way that keeps backward compatibility. But I think we need to get buy-in from others who are now using SNIRF.

Also, I think we need to better understand @fangq’s tools as they are likely quite useful given his history of such contributions and his present funding to develop this further.

samuelpowell commented 2 years ago

@dboas I understand, thank you.

To clarify, the approach I tested above was to have an array of a compound type, where the compound type contains each of the fields (i.e. an array of structs). I don't know if this was the original proposal, or if you were considering a single group with arrays for each field (i.e. a struct of arrays). Either approach solves the problem and could be optional as far as I can see.

I'll leave this issue open for now, and in the mean time am discussing JSNIRF further with @fangq on that repo.

fangq commented 2 years ago

@samuelpowell, if you can share a sample snirf file with a large inflation ratio (the file itself does not have to be big), I think it is worthwhile to use h5dump or other tools to find out what had caused the overhead. It might be just unnecessary padding. I will be happy to take a look.

on the other side, see my comment in another thread regarding the use of metaDataTag.{AccessionNumber,InstanceNumber} as a workaround for data streaming. It is DICOM inspired, but is more portable (both HDF5 and JSON compatible) and may not have such overhead issue (given that a post-processing is needed to recognize data sequence and concatenate them once the recording is done).

samuelpowell commented 2 years ago

@fangq thank you for offering to take a look at the files, please see attached. Any insights welcome!

group_enumeration.zip contains SNIRF compliant metadata (~200MiB) array_enumeration.zip contains the same enumeration data, but it is written as an array of a compound datatype (~2MiB)

samuelpowell commented 2 years ago

Brief note that creating the HDF5 file with compatibility limited to more recent versions of the library, e.g.,

hid_t fapl = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_libver_bounds(fapl, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST);
fid = H5Fcreate("file.h5", ...

enables the compact storage for the group data. This reduces the file size from 200MiB to ~160MiB, which is not particularly significant.

dboas commented 2 years ago

@sstucker and I were discussing that one possible solution to the overhead issue is to have the option that each element of /nirs(i)/data(j)/measurementList(k) could be comprised of arrays for /nirs(i)/data(j)/measurementList(k)/sourceIndex, /nirs(i)/data(j)/measurementList(k)/detectorIndex, etc all with the same length. This suggested is based on the observation that each element of /nirs(i)/data(j)/measurementList(k) is treated as a new block of data (do I have that right?) that creates a lot of overhead. So instead of have thousands of /nirs(i)/data(j)/measurementList(k) we can have only one /nirs(i)/data(j)/measurementList(1) element and put the thousands of measurements in arrays for the sub-fields which themselves are NOT blocks.

It seems to me that we can allow for the addition of this usage while keeping backward compatibility, i.e. people can still write snirf files with many elements for /nirs(i)/data(j)/measurementList(k). We just have to change the readers and writers to recognize this potential usage.

Thoughts?

samuelpowell commented 2 years ago

@dboas this would certainly fix the problem I have described, and is probably the simplest / least intrusive way to do so.

One thought. Perhaps when using the structure of arrays style you suggest, the index (k) could be dropped from /nirs(i)/data(j)/measurementList(k), such that the presence of /nirs(i)/data(j)/measurementList implies the nested arrays. This reflects the way that the /nirs(i) entry is permitted to be unindexed iff there is one entry. This might make it easier to parse this special case and to catch invalid use.

But in any case, I’d be very pleased with this change.

sstucker commented 2 years ago

This change sounds good to me. It would be a bit annoying programmatically (to recognize which format is in use), but no more strained than the notion of Indexed Groups already is.

huppertt commented 2 years ago

The reason that we coded the snirf file like this (despite knowing it would be more overhead) is that I saw file I/O and RAM usage to be more of a bottle neck than disk storage. Specifically, I was thinking in the context of two scenarios that will likely come up in future data mining methods. 1) If you put the data into one large chuck (eg a multidimensional data array/matrix instead of a series of single channels, then you need to unpack the whole thing to access the data. Since the data fields can be compressed (filtered in HDF5 notation) this would prevent extraction of channel subsets. I was specifically thinking of (eg) 70k channel case that I was considering in this context. I was envisioning data mining/machine learning methods in this context. 2) This self-contained single channel notation allows you to add/remove data inside the HDF5 file without having to rewrite the file. So you can dynamically prune channels from the file without having to unpack the file.

Putting the metadata into a complex array and having the measurement link per channel refer to the index in the array might be a solution. I wonder if you can use symbolic links to do this (since HDF5 has symbolic links/memory maps as a class). It’s possible that if you did it with symbolic links, you wouldn’t even need to worry about different formats because the HDF5 backend would take care of it

I think if you added compression to the actual data (per channel), you would probably be able to reduce the file size by more than the 500Mb and might appreciate not having to unpack/decompress the whole 70k by timepoints matrix just to access one channel

samuelpowell commented 2 years ago

@huppertt are you saying that the choice of the current design is such that one might choose to prune a particular channel k by deleting its entry in the metadata, e.g., /nirs(i)/data(j)/measurementList(k)?

I understand that this is certainly a low cost operation in the current specification.

What I don’t understand is that as soon as you actually want to use the data, you still have to access the time series which the specification states is a 2D (channels x time points) matrix. This cannot necessarily be done efficiently in terms of I/O (e.g. through memory mapping the array) because the SNIRF specification doesn’t prohibit the use of chunked and/or compressed datasets (which preclude this approach).

sstucker commented 2 years ago

Going ahead and drafting an implementation of our solution. We can benchmark the proposed format and compare to your experience @samuelpowell

samuelpowell commented 2 years ago

Splendid, thank you @sstucker . We're about to release a MATLAB package for conversion of LUMO data - I'll make sure we include the proposed solution as an option and keep it up to date with whatever happens. Ping me here if you need anything.

sstucker commented 2 years ago

See above for an early draft fix

huppertt commented 1 year ago

It’s been a while since I looked at the spec. I could have sworn we had put the data vector as a sub field to each channel rather than as a matrix entry. I think I am just losing my mind because that is not how we did it. Everything I said was based on the assumption that the data for each channel was stored as a child of that channel’s definition. But, since I am clearly forgetting how we actually did it in the end and that the data is not separated by channel, just forget everything I wrote before 🤦‍♂️🤦‍♂️🤦‍♂️

Theodore Huppert, PhD University of Pittsburgh Department of Electrical and Computer Engineering

Email: @.*** Phone: 1-412-647-8459 Website: www.huppertlab.net

From: Samuel Powell @.> Sent: Monday, March 21, 2022 6:56:11 PM To: fNIRS/snirf @.> Cc: huppertt @.>; Mention @.> Subject: Re: [fNIRS/snirf] Overhead of channel descriptor groups (Issue #103)

@huppertthttps://github.com/huppertt are you saying that the choice of the current design is such that one might choose to prune a particular channel k by deleting its entry in the metadata, e.g., /nirs(i)/data(j)/measurementList(k)?

I understand that this is certainly a low cost operation in the current specification.

What I don’t understand is that as soon as you actually want to use the data, you still have to access the time series which the specification states is a 2D (channels x time points) matrix. This cannot necessarily be done efficiently in terms of I/O (e.g. through memory mapping the array) because the SNIRF specification doesn’t prohibit the use of chunked and/or compressed datasets (which preclude this approach).

— Reply to this email directly, view it on GitHubhttps://github.com/fNIRS/snirf/issues/103#issuecomment-1074501111, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALNU3OTG2CWCTJP725PZSUDVBD5AXANCNFSM5QQE5ERA. You are receiving this because you were mentioned.Message ID: @.***>

fNIRS / snirf

Overhead of channel descriptor groups #103