Photon-HDF5 / photon-hdf5

Photon-HDF5 Reference Documentation
http://photon-hdf5.readthedocs.io/
3 stars 3 forks source link

User group, tcspc offset and markers #46

Open harripd opened 2 months ago

harripd commented 2 months ago

Summary of changes

This PR introduces a small number of actual additions to the format, while also clarifying a number of points without changing the actual format.

Additions to the format

  1. Markers: limited support for markers is not included with the addition of the /photon_data/measurement_specs/markersX groups (X being a positive integer). These denote ids in the /photon_data/detectors arrays that are assigned not to photons but any arbitrary "event" marker. The meaning of markers is not defined anywhere in the spec, but should be specified by the file creator inside the user-defined /user/ group (see point number 3). As no additional support for markers is specified, non-custom software should ignore photons designated as markers. (Note future versions may provide for defining markers for certain types of FLIM etc., but these are in future versions, not 0.5)
  2. /setup/detectors/tcspc_offset` array has been added to specify any offset that should be applied to different detectors as a result of differing distances between detectors.
  3. /user/ group- this is not a change per-se, but still a major clarification. The /user/ group is now on official part of the documentation, and is specifically designated as being open for the file creator (the user) to define as he/she wishes. Although there are some suggested ways of using it:
    • Use the /user/[vendor] group to store raw metadata loaded from the header of the file, this ensures preservation of data
    • Use /user/experimental_settings to store various experimentally determined values like FRET correction factors etc. as well as an explanation of any marker ids in the data

Clarifications

  1. Reworded text on measurement_type adding emphasis that "generic" is now the preferred type and while we may still add a new types , there will need to be a very compelling reason to do so.
  2. Reworded some of the text (without changing definitions) on ALEX related fields under /photon_data measurement_specs to hopefully make the spec more clear/human readable.
  3. Improved (hopefully nothing missed) to documentation syntax
    • Made explicit that certain fields should have the same number of values in others (e.g. timestamps and detectors) or match the value of another field (e.g. num_pixel and setup/detectors/id)
    • More consistent naming of fields:
      • if a referenced field is a group, end with a /
      • root level groups will always start with a /
      • non-group fields will not end with a /
      • always treat as monospaced when a field is being referenced and not defined.
  4. Renamed "Detector IDs" to "Record IDs" to reflect inclusion of marker records
    • Terminology renamed throughout document (hopefully all instances caught)
      • Before terminology was split between "detectors", "detector IDs" and "pixel IDs" now it should be more consistent, and precise
    • New terminology:
      • "record ID" is inclusive of both real photons and markers
      • "detector ID" is a real photon, arising from a physical detector
      • "marker ID" is any other sort of record, like a sync photon or marker for FLIM measurements, currently v0.5 has no official way to interpret these, and so will be ignored
    • Subsection of Marker IDs added for added clarification
smXplorer commented 2 months ago

About the interpretation of "detector ID": not all counts from a detector is a real photon, therefore it might be preferable to talk about detector pulse (or count).

Regarding the /user/ group: the openendedness of the specification seems like an invitation to utter confusion. In a very extreme case, I can imagine someone dumping the raw data file as a byte array "for the record". This is of course unlikely to happen, but you get the drift. I would suggest holding off on making this a part of 0.5.

Consider replacing "he/she" by "she/they/he".

harripd commented 2 months ago

About the interpretation of "detector ID": not all counts from a detector is a real photon, therefore it might be preferable to talk about detector pulse (or count).

How about rewording this section to:

"Simply put a detector ID corresponds to an event that results from a pulse (real photon or background/afterpulse) at one of the detectors (pixels)."

This way we don't ensure that it is a real photon, but also makes clear that if it is not, the instrument is "interpreting" it as a photon, as opposed to markers, which are most definitely not photons.

Regarding the /user/ group: the openendedness of the specification seems like an invitation to utter confusion. In a very extreme case, I can imagine someone dumping the raw data file as a byte array "for the record". This is of course unlikely to happen, but you get the drift. I would suggest holding off on making this a part of 0.5.

If you look at the phconvert notebooks that Antonio published, they actually use the /user group to store the header metadata. Further, again using phconvert as a guide, the assert_valid_photon_hdf5 function specifically ignores the /user group. So making /user a part of 0.5 is really just putting it in the documentation, when the practical implementation already does this.

Would additional text discouraging "data dumping" be helpful? Or additional emphasis that the core interpretation of data should not require the /user group? It should more be a semi-structured place to store metadata that currently has no official place in the spec.

smXplorer commented 2 months ago

Agreed with the wording, although I have reservations with introducing markers in 0.5.

My bad, I did not dig into phconvert and missed Antonio's usage of an undocumented feature (I couldn't find any mention of that group in the docs or in the multispot photon-hdf5 files we released with the Methods paper for that matter).

There probably should be guidelines in such a "no rule" section, but my preliminary thinking was, if anything, to impose some kind of json or similar string-based structure. I know that Matlab will read an hdf5 file and intelligently create the appropriate data structure internally, so a limitation to strings has no real bearing here, but other languages might have a harder time (python converts json to dictionaries easily). My point is that since it was so embryonic in Antonio's implementation, it might be a good thing to mull over it a bit more, before regretting having opened this Pandora box in the future. Remember, you can't undo anything in a backward compatible file format...

If I understand what you are saying, Antonio just dumped the header of whatever file he was converting into this /user/ group (presumably as a string?). But what would be the use of this metadata without the raw data? There might a few bits of information to glean from it (firmware version, TAC or TDC settings, etc.), but I would argue they will have no practical use for data analysis and are therefore not needed. The point is that photon-hdf5 data is processed (photons are dispatched into detector buckets, ordered), and having the info that was used to perform this processing will generally not allow to get back to the raw file. If people want the raw file, let them get the raw file. Photon-hdf5's purpose is not to allow getting back to it.

harripd commented 2 months ago

The way Antonio stored the metadata was not as a string, but rather in a series of separate arrays.

Basically both Picoquant and B&H headers are dictionaries, and so Antonio just made fields names the same as the dictionary keys, and values equivalent to the values of the dictionary (so basically /user/picoquant or /user/becker_hickl would have a bunch of arrays, most of them single element, named after the keys in the header files).

The main purpose of storing this metadata is that these headers often contain information about the offsets, powers etc used for the detectors lasers etc. None of this data is used by FRETBursts, but I think it is good to still store it as reference for an intrepid person to investigate just in case. For instance some B&H cards (like the one I'm using) can specify the TAC window in ways that phconvert still doesn't quite parse correctly (I've been studying, and still can't perfectly match what I know is the TAC window with the various fields in the B&H .set file) so sometimes I wind up with files where the nanotimes unit is wrong. Having the original metadata there is useful in case someone wants to check that the conversion processes went correctly. It shouldn't be the case, but it is a good check.

smXplorer commented 2 months ago

I understand the points and I am not trying to be difficult, but just to instill the sense of caution that should be present in the mind of someone trying to develop a useful tool.

You have an _excitation_inputpowers parameter in the setup group that would address the laser power point. The offset is a newly introduced parameter. What else is really needed that should then be added to the format?

My take on the other problem you are describing, is that if a photon-hdf5 file does not produce the results that the original creator is trying to achieve, the file is incorrect and should not be released by the original creator. I would argue that one should carry out one's analyses with the photon-hdf5 file they are going to release, since their Jupyter Notebook (or Matlab script, or whatever code is shared for reproducibility purposes) will use that file. Converting the file properly is internal voodoo that can be described if needed, but is way upstream. I am not sure how providing the parameters you are mentioning would help in any way figuring out that the photon-hdf5 is actually bogus and how this would help in correcting it. It might if it is indeed a simple rescaling but if things are bit more involved - nonlinear - then this won't help (I am only using binned std files from B&H, and have only dealt with ``fake" ptu files from Leica, so I am just interpolating from this).

Note that I am not against dumping header files from vendors but then my suggestion would be to call this that, for instance as a source_file_header field in the provenance group, not in a /user group open to all abuses :-).

Note that since there is no guarantee vendors will stick to one format (assuming it can easily be converted for a given version), my preference would be to dump the raw header as a string, leaving it up to the motivated user to do whatever forensic analysis they wish with it.

harripd commented 18 hours ago

@smXplorer @marktsuchida I've updated the text with these main changes:

  1. Changed markers to non-photon IDs, and clarified the language throughout
  2. Added a note that while supported, non-photon IDs will break backwards compatibility, and thus should only be used if absolutely necessary.

I encourage everyone to go over this carefully,