Photon-HDF5 / photon-hdf5

Photon-HDF5 Reference Documentation
http://photon-hdf5.readthedocs.io/
3 stars 3 forks source link

Working around String Arrays #19

Closed smXplorer closed 9 years ago

smXplorer commented 9 years ago

Currently, the format has a single field which is of type string array, namely the dye_names field in the /sample group. This causes problems for LabVIEW users using the h5labview binding, since there is currently a bug in it which prevents reading arrays of strings. A possible work-around would be to store the information in a single string, with the convention that dye names are separated by commas (or semicolon). This convention could be used for other string fields which are currently very general, such as sample_name or buffer_name (in the same /sample group). A buffer could conceivably comprised of different components, which could be specified as strings separated by commas (or semicolon). The same goes for the sample.

tritemio commented 9 years ago

Let me premit that these are metadata fields, not used to analyze the data. The user wanting to check one of these fields can always use HDFView.

Now to the point. For dye_names it may be a workaround, it is "encoding" a list of uniform items (dye names) inside a single string. However it will add burden for all the other languages.

For example in python, this code:

dye_names = h5.root.setup.dye_names.read()

would become:

dye_names_string = h5.root.setup.dye_names.read()
dye_names = dye_names_string.split(',')

In MATLAB is even worse, you have to resort to regular expression, so this:

dye_names = h5read(filename, '/setup/dye_names');

becomes:

dye_names_string = h5read(filename, '/setup/dye_names');
dye_names = regexp(dye_names_string, ',', 'split');

But, thinking about it, is not only the additional code line. Is that in this way we are breaking the crucial assumption that the data type is self-describing. We are adding an additional level of encoding which is not a good thing.

So let discuss if adding an exception for dye_names is worth it. Definitely I would not do it for the other fields you mention, that logically are not "enumerations" or items.

My point is, isolate the exception, don't build upon it.

smXplorer commented 9 years ago

OK, this is all very nice philosophically, but the reason I am bringing this is that I have a problem I cannot solve. What about this philosophical bit? HDF5 is multi platform, multi language, etc. It is only as far as it is supported. The bug I pointed out will not be fixed in the immediate future and therefore, LabVIEW will not be able to read the format as is. And there is of course work to do in LabVIEW too to interpret a CSV string. Practically, I think a series of names separated by a symbol is just as self-explanatory as a an array of such words.

tritemio commented 9 years ago

@smXplorer, this is a genuinely technical point, not a philosophical one. As I said I'm open to your workaround for dye_names, not to making it a rule.