bytedeco / javacpp-presets

The missing Java distribution of native C++ libraries
Other
2.65k stars 736 forks source link

[hdf5] Reading with BytePointer stops on 0 values (zeros) #1311

Closed calvertdw closed 9 months ago

calvertdw commented 1 year ago

When 0s are in the data, I find I must use a ShortPointer to get the full data for some reason. I'm not sure why. The top test case passes, native bytes and be written and read correctly, the second test fails.

import org.bytedeco.hdf5.*;
import org.bytedeco.hdf5.global.hdf5;
import org.bytedeco.javacpp.*;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Test;

import java.nio.ByteOrder;
import java.util.Arrays;

public class HDF5TestForReport
{
   static
   {
      System.out.println(String.format("Native byte order: %s", ByteOrder.nativeOrder()));
   }

   @Test // Passes
   public void testBytesWithoutZeros()
   {
      String filePath = "NativeBytesWithoutZeros.hdf5";
      H5File h5File = new H5File(filePath, hdf5.H5F_ACC_TRUNC);
      String datasetId = "bytes";

      byte[] writeData = new byte[] { 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'f',
                                      -2, -1, 1, 2, 3, 4, 5, Byte.MIN_VALUE, Byte.MAX_VALUE };
      BytePointer dataPointer = new BytePointer(writeData);

      int rank = 1;
      long[] dimensions = { dataPointer.limit() };
      DataSpace fileSpace = new DataSpace(rank, dimensions);
      DataType dataType = new DataType(PredType.NATIVE_B8());
      DataSet dataSet = h5File.createDataSet(datasetId, dataType, fileSpace);

      dataSet.write(dataPointer, dataType);

      dataSet.close();
      fileSpace.close();
      h5File.close();

      h5File = new H5File(filePath, hdf5.H5F_ACC_RDONLY);
      dataSet = h5File.openDataSet(datasetId);

      dataPointer = new BytePointer(writeData.length);
      dataSet.read(dataPointer, dataType);

      byte[] readData = new byte[(int) dataPointer.limit()];
      dataPointer.get(readData);

      dataSet.close();
      h5File.close();

      System.out.printf("Wrote: %s%n", Arrays.toString(writeData));
      System.out.printf("Read:  %s%n", Arrays.toString(readData));
      Assertions.assertArrayEquals(writeData, readData);
   }

   @Test
   public void testBytesWithZeros()
   {
      String filePath = "NativeBytesWithZeros.hdf5";
      H5File h5File = new H5File(filePath, hdf5.H5F_ACC_TRUNC);
      String datasetId = "bytes";

      // There is a 0 in here that causes the problem
      byte[] writeData = new byte[] { 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'f',
                                      0, -2, -1, 1, 0, 2, 3, 4, Byte.MIN_VALUE, Byte.MAX_VALUE };
      BytePointer dataPointer = new BytePointer(writeData);

      byte[] beforeWriteData = new byte[(int) dataPointer.limit()];
      dataPointer.get(beforeWriteData);
      System.out.printf("Before: %s%n", Arrays.toString(beforeWriteData));
      System.out.printf("Writing %s bytes%n", writeData.length);

      int rank = 1;
      long[] dimensions = { dataPointer.limit() };
      DataSpace fileSpace = new DataSpace(rank, dimensions);
      DataType dataType = new DataType(PredType.NATIVE_B8());
      DataSet dataSet = h5File.createDataSet(datasetId, dataType, fileSpace);

      dataSet.write(dataPointer, dataType);
      System.out.printf("Wrote:  %s%n", Arrays.toString(writeData));

      dataSet.close();
      fileSpace.close();
      h5File.close();

      h5File = new H5File(filePath, hdf5.H5F_ACC_RDONLY);
      dataSet = h5File.openDataSet(datasetId);

      // Workaround wrapping with ShortPointer
      dataPointer = new BytePointer(writeData.length);
      dataSet.read(new ShortPointer(dataPointer), dataType);
      System.out.printf("Read    %s bytes%n", dataPointer.limit());
      byte[] workaroundReadData = new byte[(int) dataPointer.limit()];
      dataPointer.get(workaroundReadData);
      System.out.printf("Read:   %s%n", Arrays.toString(workaroundReadData));

      // I would expect this to work, but it stops before the 0 value
      dataPointer = new BytePointer(writeData.length);
      dataSet.read(dataPointer, dataType);                            // <-- Problem line
      System.out.printf("Read    %s bytes%n", dataPointer.limit());
      byte[] readData = new byte[(int) dataPointer.limit()];
      dataPointer.get(readData);
      System.out.printf("Read:   %s%n", Arrays.toString(readData));

      dataSet.close();
      h5File.close();

      Assertions.assertArrayEquals(writeData, workaroundReadData); // Passes
      Assertions.assertArrayEquals(writeData, readData);                    // Fails
   }
}
saudet commented 1 year ago

That sounds like an issue with the API of HDF5? I mean, a different function gets called, right?

calvertdw commented 1 year ago

I'm having trouble tracking it down. That read call is:

public native void read(Pointer buf, @Const @ByRef DataType mem_type);

So, they are calling the same method since they both extend Pointer. I'm not sure what lives on the other side of that native call.

calvertdw commented 1 year ago

Oh , actually, there is also a

public native void read(@StdString @ByRef BytePointer buf, @Const @ByRef DataType mem_type);

So that one is a different method. All other pointers would be calling the Pointer one.

calvertdw commented 1 year ago

Would the @StdString cause some kind of null termination issue? Maybe we should remove that.

calvertdw commented 1 year ago

Ah, yeah. In hdf5's H5DataSet.cpp, there are two functions:

void DataSet::read(void *buf, const DataType &mem_type, const Data
              const DSetMemXferPropList &xfer_plist) const

and

void DataSet::read(H5std_string &strg, const DataType &mem_type, const DataSpace &mem_space,
              const DataSpace &file_space, const DSetMemXferPropList &xfer_plist) const

So when we use a BytePointer, we are accidentally calling the string ones.

Note this would also be an issue for the cooresponding write methods, which somehow work, but we should still fix it,

calvertdw commented 1 year ago

Btw, until this gets fixed, for writing/reading all you have to do is cast to Pointer to get it to call the correct method, like this:

dataSet.read((Pointer) dataPointer, dataType); 

No need to muck around with ShortPointer or anything.

saudet commented 1 year ago

Yes, that's how we need to do it. What kind of fix do you have in mind?

calvertdw commented 1 year ago

So I was thinking of re-mapping the read(H5std_string &strg and write(H5std_string &strg C++ signatures to readString(String and writeString(String on the Java side. Is there a way to do that? (Java side readString(BytePointer would be just fine as well)

saudet commented 1 year ago

Yeah, it should be possible see https://github.com/bytedeco/javacpp/wiki/Mapping-Recipes#specifying-names-to-use-in-java, but that's a problem with the C++ API. If you call read() or write() with a char* in C++, you're probably going to end up calling the "wrong" function as well.

saudet commented 1 year ago

In my opinion, that's something you should be reporting upstream: https://www.hdfgroup.org/community/community-development/

calvertdw commented 1 year ago

Good thing we have a good workaround, I could see that one taking a while to get changed :laughing:.

calvertdw commented 1 year ago

I reported a bug there: https://jira.hdfgroup.org/browse/HDFVIEW-297

mkitti commented 1 year ago

@calvertdw why not create a Github issue here: https://github.com/HDFGroup/hdf5/issues ?

calvertdw commented 1 year ago

Great point. The other was getting no activity. Done!

calvertdw commented 1 year ago

@saudet It seems that this won't be addressed upstream unfortunately.

Would the re-mapping solution mentioned in the comment above work?

saudet commented 1 year ago

We can add those as helper methods if you want. Sounds good?

mkitti commented 1 year ago

Are the strings encoded according to Java's modified UTF-8 encoding? https://docs.oracle.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8

If so, isn't the correct way to read this via the JNI function NewStringUTF? https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html#NewStringUTF

That's how hdf.hdf5lib.H5Dread_string is implemented. So a simple implementation should just be passing the id field of the C++ H5DataSet to hdf.hdf5lib.H5Dread_string, right? https://github.com/HDFGroup/hdf5/blob/81bc34ac4c3f42532edee32095651f6bcd5e55a2/java/src/jni/h5dImp.c#L993-L1054

calvertdw commented 1 year ago

@mkitti You are misundertanding the issue I was having. I was trying to read raw data (encoded JPEG images) from an HDF5 file. I had a BytePointer JavaCPP object and called the HDF5 read method. What I got was only a handful of bytes until the first '0' byte occurred. This was a major head scratcher and I had to dig in to find out I was accidentally calling the String function.

The problem is that to read/write literally just raw data, HDF5 is making that risky by inviting the opportunity to accidentally call a version of read that truncates the read on the first '0'. As that is the most basic of data storage functions, I am a bit dissappointed that HDF5 is not super clear about what those functions do and also super clear about the data truncation in the string case. There is not even documentation on those most basic of functions in the code, which is what I would read as a user of the C++ API.

mkitti commented 1 year ago

I think https://github.com/HDFGroup/hdf5/pull/3083 still fixes the situation. Even if you read it as a std::string by accident that fix will read the entire JPEG into the std::string. It will not stop at the first null byte since we no longer use the C string convention.

saudet commented 1 year ago

@mkitti Sounds like a fix to me. When does it get released?

mkitti commented 1 year ago

Well it got merged into the main development branch, so now it is just a matter of the HDF5 release schedule in their README. https://github.com/HDFGroup/hdf5/blob/develop/doc/img/release-schedule.png

calvertdw commented 1 year ago

It's still not the fix I was hoping for, because the correct way to read/write bytes is to use the standard read and write functions. The C++ code does seem reasonable, it seems like the issue is now that JavaCPP loses the type difference between a String and ByteBuffer by using BytePointer for both. So for all C++ libraries that have signatures where the only difference is a void* vs a std::string would have an issue.

calvertdw commented 1 year ago

Even if keeping using the string version, I think users would still sometimes get errors, from the changlog of https://github.com/HDFGroup/hdf5/pull/3083:

Fixed length datasets are now read into H5std_string as a fixed length string of the appropriate size. Variable length datasets will still be truncated at the first null byte.

A JPEG image is variable length, after all.

saudet commented 1 year ago

It still sounds like an issue with HDF5 to me. The read() function has no idea how much memory was allocated in the pointer it was given, so that's a safety issue. There should be a size parameter somewhere to indicate the maximum size. That's not an issue with std::string since that can allocate as much memory as we need.

mkitti commented 1 year ago

A JPEG image is variable length, after all.

How so? Its length is not determined by some token in the byte sequence.

When you write the bytes of the image, you tell HDF5 how many bytes you want to write.

mkitti commented 1 year ago

The read() function has no idea how much memory was allocated in the pointer it was given, so that's a safety issue.

The memory space argument is meant to indicate how many bytes have been allocated.

See the docs for the C function H5Dread.

https://docs.hdfgroup.org/hdf5/v1_14/group___h5_d.html#ga8287d5a7be7b8e55ffeff68f7d26811c

saudet commented 1 year ago

Ok, so inversely, why do we need to that memory space argument for std::string??

mkitti commented 1 year ago

Ok, so inversely, why do we need to that memory space argument for std::string??

You do not need it in C++. It defaults to DataSpace::ALL.

https://github.com/hdfgroup/hdf5/blob/02c71bd65f3b9d62b2be167eda3adf80668f0ee4/c%2B%2B/src/H5DataSet.h#L79-L84

That appears to get translated to this via JavaCPP:

public void read(@StdString @ByRef BytePointer buf, @Const @ByRef DataType mem_type)

http://bytedeco.org/javacpp-presets/hdf5/apidocs/org/bytedeco/hdf5/DataSet.html#read-org.bytedeco.javacpp.BytePointer-org.bytedeco.hdf5.DataType-

However, you can specify a file dataspace if you only want to read some of the bytes and copy it into a specific location of a buffer via memory space.

saudet commented 11 months ago

BTW, does it make sense to map H5std_string to BytePointer at all? Or should that be reserved for char*?

mkitti commented 11 months ago

BytePointer does not make very much sense to me a peer to std::string.

Perhaps java.lang.StringBuffer or StringBuilder would be better?

saudet commented 11 months ago

There isn't anything interesting about StringBuilder from a JNI point of view, we might as well just pass the internal char[] in that case, which already works with std::u16string, but I don't think HDF5 supports that. I think it only supports std::string, which doesn't support UTF-16, so that's probably not an interesting idea either.

calvertdw commented 11 months ago

It seems like JavaCPP should have a new type to match with a std::string, then? I know that'd be quite some work.

mkitti commented 11 months ago

Java has String, StringBuffer, and StringBuilder.

std::string is mutable and not synchronized.

saudet commented 11 months ago

It seems like JavaCPP should have a new type to match with a std::string, then? I know that'd be quite some work.

No, it's not difficult. I've added support for that in commit https://github.com/bytedeco/javacpp/commit/7227ec6846cfba24eb4ccf18a7945f6f8cd45ccd, so we can add something like that to HDF5's presets:

infoMap.put(new Info("std::basic_string<char>", "std::string", "H5std_string").pointerTypes("H5std_string").define());

But it's not exactly a nice (efficient) interface for a string and that replaces all instances of std::string... Probably not what you want.

Java has String, StringBuffer, and StringBuilder.

We can access String efficiently from JNI, not so with StringBuffer or StringBuilder.

calvertdw commented 10 months ago

That looks great! That sounds like it will absolutely solve this issue, since this is about saving raw data with HDF5 and avoiding accidental calls to the string functions.

saudet commented 10 months ago

Looking at this again, it seems like everything works like we want when casting explicitly to Pointer, and if we don't cast, it just goes through a std::string, which might be less efficient, but it still works correctly with the latest fixes in HDF5, so I don't see any problems the way things are now. Can we close this? @calvertdw

calvertdw commented 9 months ago

Yes. Thank you!!