UTF8 names + \0 termination

HDFGroup / HDF.PInvoke

Raw HDF5 Power for .NET

http://www.hdfgroup.org/HDF5

Other

81 stars 29 forks source link

UTF8 names + \0 termination #90

Closed hokb closed 8 years ago

hokb commented 8 years ago

Do all functions handling names as UTF8 byte[] sequences really expect the encoded name to end with \0?

Apparently something like this is working as well:

string Path = "寿司"; 
objID = H5O.open(File.ID, Encoding.UTF8.GetBytes(Path), H5P.DEFAULT);

But I am scratching my head why it is working? Maybe, by chance there is a 0 at the next byte in memory or one should manually terminate the string, no?

objID = H5O.open(File.ID, Encoding.UTF8.GetBytes(Path + '\0'), H5P.DEFAULT);

Just out of interest: HDF5 path names must not be prefix-free, rigth? It is not that the match for an object is done characterwise until a first matching link is found or something? In this case we could get away without \0 termination. But I suspect the \0 is actually explicitly needed?

I couldn't find it in the documentation / specification.

gheber commented 8 years ago

Yes. I'm not sure if this is just dumb luck or if Encoding.UTF8.GetBytes or the Marshaler throw in a gratuitous \0, but it's expected.

I don't understand the second question. What do you mean by "prefix-free"?

hokb commented 8 years ago

This still puzzles me. 'prefix free' was just a shot in the dark trying to explain why the missing 0 termination does not seem to have any negative effect. I am pretty sure it does not apply to HDF5 path / link names. But the fact that actually 0 termination is needed ... well, I don't know why all paths seem to work fine via the const char* name parameters. Regardless if we apply the termination Encoding.GetBytes(name + '\0') or if we don't: Encoding.GetBytes(name). Even without the \0 I did never encounter an error. Somehow it is hard to guess off such great luck... :|

hokb commented 8 years ago

Encoding.UTF8.GetBytes does not introduce any 0. 0 from the point of view of the encoder is a valid character and is encoded in the regular way when found. It will not introduce any new characters in the encoded byte sequence. This would be just plain wrong. Neither the marshaller is involved: we are only passing a pointer to a byte[] array along. I suppose on our managed heaps there is a good chance to create arrays on an zeroed area. However, in order to be sure we need to add the terminating 0 manually everywhere.

randomheapusedbygetbytes

DSanchen commented 8 years ago

Just a guess: Maybe the Marshaler is smart enough to zero out a large enough area (+1) for the required amount of bytes before he allocates the pointer to that area ? But adding a own terminating 0 would of course be safer...

hokb commented 8 years ago

The strange thing is that byte is blitable. A byte[] array will not be copied by the marshaller, unless you specify to do so explicitly. It is basically the same as having byte* as the parameter. Both are interchangable in fact. Without 'blitable' types we wouldn't be able to interface unmanaged libs efficiently (see JNI).