deshaw / versioned-hdf5

Versioned HDF5 provides a versioned abstraction on top of h5py
https://deshaw.github.io/versioned-hdf5/
Other
76 stars 20 forks source link

Fix chunk reuse verification for string dtype arrays #348

Closed peytondmurray closed 3 months ago

peytondmurray commented 3 months ago

This PR fixes an issue with string datasets where reused chunks were not correctly verified.

Previously, chunks that were written to the dataset and then reused contained bytes elements, but chunks that were pending a write but being reused (e.g. by some other chunk in the pending write operation) could contain str elements, causing problems for the array comparison. With this change, both the chunk that the user is trying to write and the chunk to be reused are coerced to object dtype arrays of bytes before the comparison is completed.

Additionally multidimensional string datasets are now correctly verified as well, closes #339 and closes #338.

peytondmurray commented 3 months ago

No, I think the call to vectorize should broadcast across all dimensions. I think the exception in that issue happens because of the way that we detect whether we need to cast each element of the array as a bytes object. We can't use the dtype of the array because string arrays are read out of the file as object dtype arrays, so instead do this by looking at the type of the first element of the array:

if len(arr) > 0 and isinstance(arr.flatten()[0], bytes):
#                                     ^
#                            multidimensional datasets need to be flattened first!

Previously we just weren't flattening the multidimensional arrays, which meant we ended up trying to call bytes on a bytes object, which fails.