Photon-HDF5 / photon-hdf5

Photon-HDF5 Reference Documentation
http://photon-hdf5.readthedocs.io/
3 stars 3 forks source link

User group, tcspc offset and markers #46

Open harripd opened 3 months ago

harripd commented 3 months ago

Summary of changes

This PR introduces a small number of actual additions to the format, while also clarifying a number of points without changing the actual format.

Additions to the format

  1. Markers: limited support for markers is not included with the addition of the /photon_data/measurement_specs/markersX groups (X being a positive integer). These denote ids in the /photon_data/detectors arrays that are assigned not to photons but any arbitrary "event" marker. The meaning of markers is not defined anywhere in the spec, but should be specified by the file creator inside the user-defined /user/ group (see point number 3). As no additional support for markers is specified, non-custom software should ignore photons designated as markers. (Note future versions may provide for defining markers for certain types of FLIM etc., but these are in future versions, not 0.5)
  2. /setup/detectors/tcspc_offset` array has been added to specify any offset that should be applied to different detectors as a result of differing distances between detectors.
  3. /user/ group- this is not a change per-se, but still a major clarification. The /user/ group is now on official part of the documentation, and is specifically designated as being open for the file creator (the user) to define as he/she wishes. Although there are some suggested ways of using it:
    • Use the /user/[vendor] group to store raw metadata loaded from the header of the file, this ensures preservation of data
    • Use /user/experimental_settings to store various experimentally determined values like FRET correction factors etc. as well as an explanation of any marker ids in the data

Clarifications

  1. Reworded text on measurement_type adding emphasis that "generic" is now the preferred type and while we may still add a new types , there will need to be a very compelling reason to do so.
  2. Reworded some of the text (without changing definitions) on ALEX related fields under /photon_data measurement_specs to hopefully make the spec more clear/human readable.
  3. Improved (hopefully nothing missed) to documentation syntax
    • Made explicit that certain fields should have the same number of values in others (e.g. timestamps and detectors) or match the value of another field (e.g. num_pixel and setup/detectors/id)
    • More consistent naming of fields:
      • if a referenced field is a group, end with a /
      • root level groups will always start with a /
      • non-group fields will not end with a /
      • always treat as monospaced when a field is being referenced and not defined.
  4. Renamed "Detector IDs" to "Record IDs" to reflect inclusion of marker records
    • Terminology renamed throughout document (hopefully all instances caught)
      • Before terminology was split between "detectors", "detector IDs" and "pixel IDs" now it should be more consistent, and precise
    • New terminology:
      • "record ID" is inclusive of both real photons and markers
      • "detector ID" is a real photon, arising from a physical detector
      • "marker ID" is any other sort of record, like a sync photon or marker for FLIM measurements, currently v0.5 has no official way to interpret these, and so will be ignored
    • Subsection of Marker IDs added for added clarification
smXplorer commented 3 months ago

About the interpretation of "detector ID": not all counts from a detector is a real photon, therefore it might be preferable to talk about detector pulse (or count).

Regarding the /user/ group: the openendedness of the specification seems like an invitation to utter confusion. In a very extreme case, I can imagine someone dumping the raw data file as a byte array "for the record". This is of course unlikely to happen, but you get the drift. I would suggest holding off on making this a part of 0.5.

Consider replacing "he/she" by "she/they/he".

harripd commented 3 months ago

About the interpretation of "detector ID": not all counts from a detector is a real photon, therefore it might be preferable to talk about detector pulse (or count).

How about rewording this section to:

"Simply put a detector ID corresponds to an event that results from a pulse (real photon or background/afterpulse) at one of the detectors (pixels)."

This way we don't ensure that it is a real photon, but also makes clear that if it is not, the instrument is "interpreting" it as a photon, as opposed to markers, which are most definitely not photons.

Regarding the /user/ group: the openendedness of the specification seems like an invitation to utter confusion. In a very extreme case, I can imagine someone dumping the raw data file as a byte array "for the record". This is of course unlikely to happen, but you get the drift. I would suggest holding off on making this a part of 0.5.

If you look at the phconvert notebooks that Antonio published, they actually use the /user group to store the header metadata. Further, again using phconvert as a guide, the assert_valid_photon_hdf5 function specifically ignores the /user group. So making /user a part of 0.5 is really just putting it in the documentation, when the practical implementation already does this.

Would additional text discouraging "data dumping" be helpful? Or additional emphasis that the core interpretation of data should not require the /user group? It should more be a semi-structured place to store metadata that currently has no official place in the spec.

smXplorer commented 3 months ago

Agreed with the wording, although I have reservations with introducing markers in 0.5.

My bad, I did not dig into phconvert and missed Antonio's usage of an undocumented feature (I couldn't find any mention of that group in the docs or in the multispot photon-hdf5 files we released with the Methods paper for that matter).

There probably should be guidelines in such a "no rule" section, but my preliminary thinking was, if anything, to impose some kind of json or similar string-based structure. I know that Matlab will read an hdf5 file and intelligently create the appropriate data structure internally, so a limitation to strings has no real bearing here, but other languages might have a harder time (python converts json to dictionaries easily). My point is that since it was so embryonic in Antonio's implementation, it might be a good thing to mull over it a bit more, before regretting having opened this Pandora box in the future. Remember, you can't undo anything in a backward compatible file format...

If I understand what you are saying, Antonio just dumped the header of whatever file he was converting into this /user/ group (presumably as a string?). But what would be the use of this metadata without the raw data? There might a few bits of information to glean from it (firmware version, TAC or TDC settings, etc.), but I would argue they will have no practical use for data analysis and are therefore not needed. The point is that photon-hdf5 data is processed (photons are dispatched into detector buckets, ordered), and having the info that was used to perform this processing will generally not allow to get back to the raw file. If people want the raw file, let them get the raw file. Photon-hdf5's purpose is not to allow getting back to it.

harripd commented 3 months ago

The way Antonio stored the metadata was not as a string, but rather in a series of separate arrays.

Basically both Picoquant and B&H headers are dictionaries, and so Antonio just made fields names the same as the dictionary keys, and values equivalent to the values of the dictionary (so basically /user/picoquant or /user/becker_hickl would have a bunch of arrays, most of them single element, named after the keys in the header files).

The main purpose of storing this metadata is that these headers often contain information about the offsets, powers etc used for the detectors lasers etc. None of this data is used by FRETBursts, but I think it is good to still store it as reference for an intrepid person to investigate just in case. For instance some B&H cards (like the one I'm using) can specify the TAC window in ways that phconvert still doesn't quite parse correctly (I've been studying, and still can't perfectly match what I know is the TAC window with the various fields in the B&H .set file) so sometimes I wind up with files where the nanotimes unit is wrong. Having the original metadata there is useful in case someone wants to check that the conversion processes went correctly. It shouldn't be the case, but it is a good check.

smXplorer commented 3 months ago

I understand the points and I am not trying to be difficult, but just to instill the sense of caution that should be present in the mind of someone trying to develop a useful tool.

You have an _excitation_inputpowers parameter in the setup group that would address the laser power point. The offset is a newly introduced parameter. What else is really needed that should then be added to the format?

My take on the other problem you are describing, is that if a photon-hdf5 file does not produce the results that the original creator is trying to achieve, the file is incorrect and should not be released by the original creator. I would argue that one should carry out one's analyses with the photon-hdf5 file they are going to release, since their Jupyter Notebook (or Matlab script, or whatever code is shared for reproducibility purposes) will use that file. Converting the file properly is internal voodoo that can be described if needed, but is way upstream. I am not sure how providing the parameters you are mentioning would help in any way figuring out that the photon-hdf5 is actually bogus and how this would help in correcting it. It might if it is indeed a simple rescaling but if things are bit more involved - nonlinear - then this won't help (I am only using binned std files from B&H, and have only dealt with ``fake" ptu files from Leica, so I am just interpolating from this).

Note that I am not against dumping header files from vendors but then my suggestion would be to call this that, for instance as a source_file_header field in the provenance group, not in a /user group open to all abuses :-).

Note that since there is no guarantee vendors will stick to one format (assuming it can easily be converted for a given version), my preference would be to dump the raw header as a string, leaving it up to the motivated user to do whatever forensic analysis they wish with it.

harripd commented 1 month ago

@smXplorer @marktsuchida I've updated the text with these main changes:

  1. Changed markers to non-photon IDs, and clarified the language throughout
  2. Added a note that while supported, non-photon IDs will break backwards compatibility, and thus should only be used if absolutely necessary.

I encourage everyone to go over this carefully,

marktsuchida commented 3 weeks ago

@harripd and @smXplorer,

Regarding the notes (and points raised in email) about backward compatibility: it might help to define what exactly we mean by backward compatibility (even if in an unofficial way to begin with).

There are broadly two possible policies:

There could also be a mix of the two policies, such as readers rejecting files if the major version number is newer than known (I think this might be reasonable). Also, validators (as opposed to mere readers) will necessarily have to reject files of a newer version, or at least warn that they are not performing a full validation.

Without at least some stated policy, I suspect it will be hard to agree on what constitutes backward compatibility.


For the specific case of adding the non-photon IDs, I do not think that the addition is backward incompatible, even under the second interpretation above. Prior to the addition of non-photon IDs, data from each detector ID was already only interpretable if somewhere it is specified what the detectors are (such as with spectral_ch_1), with the possible exception of when there is only one detector ID. So in the 0.4 format, it was valid to have non-photon channels in the file -- except that there was no way for a reader to interpret it.

With the addition of the non-photon designation, old (0.4, or earlier 0.5dev) readers should still be able to read files containing non-photon channels, as long as they ignore the newly added fields. I think @harripd made a similar point specifically for FRETBursts in an email.

I do not have any issue with discouraging the use of non-photon channels in general, though. It remains more a feature for internal intermediate datastores than for final ("publication-quality") sharing, at least until we standardize ways to attach semantics (such as related to raster scanning) to these channels.

harripd commented 3 weeks ago

@marktsuchida I agree with your main points. Most importantly, the term backwards compatible was not precisely defined, and we need a concrete definition.


Upon giving it some thought, I think was we often call "backwards compatibility" is actually "forwards compatibility" so let me define them bellow:

Backward compatibility means that you can read and interpret files of previous versions using newer versions of the software. This is fairly easy, and more closely aligns with Mark's first principle

Forward compatibility means that you can read and interpret files of newer versions using older versions of the software. This is much more difficult. In fact it is impossible, as any new addition will add something that, unless it is entirely redundant (and thus what is the point), previous versions of the software will not know about and thus be unable to implement. Therefore forward compatibility inevitably means that new features are implemented in a "minimally breaking" way, in other words, as long as a particular file written in a new version doesn't require a new feature, it will be readable by older versions of the software. This is basically Mark's second definition.


Ideally we strive for both, insisting on backward compatibility, and implementing forward compatibility as much as possible. Bellow I outline some more specific principles for how we can achieve this inside of photon-HDF5:

  1. New fields should always be optional/conditionally mandatory (ie mandatory only when a new feature is used in the particular experiment) with minor version updates, major version updates may make a new field mandatory.
  2. The data type (options) of a field will not change from version to version.
  3. Fields will not be removed.
  4. Whether or not it is required must be implemented in a way consistent with previous versions, again this means:
    1. Any field introduced as mandatory will necessarily be mandatory in all future versions
    2. For conditionally mandatory fields, if a set of conditions requires a field to be mandatory in a previous version, it will also be mandatory under those conditions in future versions.
    3. In cases where new features (usually in another field) are added, then the new implementation should keep the field mandatory in all cases where, ignoring the new feature.
  5. Validators should be considered version specific, they cannot validate photon-HDF5 files of versions newer that what they were designed for. However, we can implement a "permisive/stict" option (or other similar name) determining whether or not the validator checks for unknown fields, and whether or not to throw an error or simply warn.

I also like Mark's point about major and minor versions. I suggest we follow how many software packages, including Python and Numpy, as well as file formats handle their versions: major versions (the number before the . ) can introduce changes that break backwards compatibility (although trying to minimize this) while minor versions should only add new features, ie they must be fully backwards compatible, and as forwards compatible as possible.


Regarding non-photon IDs, as written they fully follow the above principles, and discouraging their inclusion further makes clear that they are internal, and intermediate, preparing for when we standardize more specific types of non-photon IDs like markers and sync signals etc.

Regarding Marks point that they do not break either forwards or backwards compatibility, it is a bit of a gray area. Basically, according to the strict interpretation of v0.4, any detector ID not defined in spectral/polarization/split_chXwas simply illegal. But in the way most code was implemented, as Mark said, the meaning of that ID is simply unknown, and so readers should simply disregard it. The latter is how most software was implemented. The biggest issue I could see happening is is someone wrote a "lazy" reader, where it assumed say just spectral_ch1 and spectral_ch2 were present, and so made a simple boolean table by mask = detectors == spectral_ch1 in which case both spectral_ch2 and the undefined non-photon IDs become False. But I would point out this could better be considered a bug or improper implementation of the v0.4 spec anyways..