Read old style groups as UTF-8 - What does the spec say?

Apollo3zehn / PureHDF

A pure .NET library that makes reading and writing of HDF5 files (groups, datasets, attributes, ...) very easy.

MIT License

47 stars 16 forks source link

Read old style groups as UTF-8 - What does the spec say? #61

Closed Apollo3zehn closed 4 months ago

Apollo3zehn commented 5 months ago

https://github.com/jamesmudd/jhdf/issues/539#issuecomment-1923308452 https://github.com/jamesmudd/jhdf/pull/544/commits/14e939d697b2a90c8bdf392089859e29e3579654

jamesmudd commented 5 months ago

Yes I thought about this as well. I couldn't see anything definitive in the spec, this is an example file were clearly the names are UTF8 encoded (but I don't know how its been created). Pragmatically I decided to change to UTF8 as its compatible with ASCII I don't see any downside, actually made me consider just switching to UTF8 everywhere even where the encoding is specifically defined in the spec and the file. Would be interested what you think on this?

Apollo3zehn commented 5 months ago

I am not sure why I originally thought that this could be an ASCII string :shrug:

I also can't find anything in the spec that says it's ASCII only, and I checked other parts of my code that rely on the "get local heap object name" function, and none of them assume ASCII. For example, in the spec, the External File List Slot has a field name called Name Offset in Local Heap and there is no reference to ASCII or UTF-8 in the field description either. Another structure that stores a name on the local heap is the Symbolic Link Scratch-pad and again, nothing is specified.

Additionally, the HDFView software can display that specific group name without problems. So to summarize, I think we are safe to assume that decoding the local heap content as UTF-8 is ok here.

grafik

Apollo3zehn commented 5 months ago

And none of my tests fail with the change.

Apollo3zehn commented 5 months ago

... actually made me consider just switching to UTF8 everywhere even where the encoding is specifically defined in the spec and the file. Would be interested what you think on this?

I also thought about this but did not dare yet to do this. This would need some investigation first to detect possible problems. One might be, that writing such files could lead to incompatibilities with the C-library. However, just reading UTF-8 everywhere should be fine I would say.