Open aldanor opened 4 years ago
Test file:
HDF5 "test.h5" {
GROUP "/" {
DATASET "a1" {
DATATYPE H5T_STRING {
STRSIZE 26;
STRPAD H5T_STR_NULLPAD;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 5 ) / ( 5 ) }
DATA {
(0): "abcdefghijklmnopqrstuvwxyz", "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
(2): "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
(3): "123\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
(4): "a\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
}
}
DATASET "a2" {
DATATYPE H5T_STRING {
STRSIZE 3;
STRPAD H5T_STR_NULLPAD;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 4 ) / ( 4 ) }
DATA {
(0): "abc", "1\000\000", "\000\000\000", "23\000"
}
}
}
}
Prototype:
use std::alloc;
use std::error::Error;
use std::mem;
use std::slice;
use std::str;
use libc::{c_void, size_t};
use ndarray::Array1;
use hdf5_sys::h5::herr_t;
use hdf5_sys::h5d::H5Dread;
use hdf5_sys::h5i::hid_t;
use hdf5_sys::*;
const STRING_SIZE: usize = mem::size_of::<String>();
extern "C" fn conv_func(
src_id: hid_t, dst_id: hid_t, cdata: *mut h5t::H5T_cdata_t, nelmts: size_t, buf_stride: size_t,
_bkg_stride: size_t, buf: *mut c_void, _bkg: *mut c_void, _dset_xfer_plist: hid_t,
) -> herr_t {
// TODO: the accepted function pointer should be unsafe by default
unsafe {
// check examples in H5Tconv.c, e.g. H5Tconv__s_s
match (*cdata).command {
h5t::H5T_CONV_INIT => {
// initialization, checks, etc - ignore for now
(*cdata).need_bkg = h5t::H5T_BKG_NO;
}
h5t::H5T_CONV_FREE => {
// nothing to do here
}
h5t::H5T_CONV_CONV => {
let buf = buf as *mut u8;
let nullterm = match h5t::H5Tget_strpad(src_id) {
h5t::H5T_STR_NULLTERM | h5t::H5T_STR_NULLPAD => true,
h5t::H5T_STR_SPACEPAD => false,
_ => panic!("unsupported"),
};
let src_size = h5t::H5Tget_size(src_id) as usize;
let dst_size = h5t::H5Tget_size(dst_id) as usize;
let (dir, mut src_buf, mut dst_buf) = if src_size >= dst_size {
(1, buf, buf)
} else {
let k = nelmts - 1;
(-1, buf.offset((k * src_size) as _), buf.offset((k * dst_size) as _))
};
let src_stride =
dir * (if buf_stride == 0 { src_size } else { buf_stride }) as isize;
let dst_stride =
dir * (if buf_stride == 0 { dst_size } else { buf_stride }) as isize;
for _ in 0..nelmts {
let mut len = src_size;
if nullterm {
// technically, nullpad has to be handled differently, but that's
// how it's done in the HDF5 library itself (H5T__conv_s_s in H5Tconv.c)
for i in 0..src_size {
if *src_buf.offset(i as _) == b'\0' {
len = i;
break;
}
}
} else {
for i in (0..src_size).rev() {
if *src_buf.offset(i as _) != b' ' {
len = i + 1;
break;
}
}
}
// alternatively, could use std::from_utf8_unchecked()?
let s =
str::from_utf8(slice::from_raw_parts(src_buf, len)).unwrap().to_string();
libc::memcpy(dst_buf as _, &s as *const _ as _, STRING_SIZE);
mem::forget(s);
src_buf = src_buf.offset(src_stride);
dst_buf = dst_buf.offset(dst_stride);
}
}
}
}
0
}
fn main() -> Result<(), Box<dyn Error>> {
unsafe {
assert!(h5::H5open() >= 0);
let type_id = h5t::H5Tcreate(h5t::H5T_OPAQUE, STRING_SIZE as _);
assert!(type_id >= 0);
assert!(h5t::H5Tset_tag(type_id, "rust::String\0".as_ptr() as *const _) >= 0);
// let h5_type_id = h5t::H5Tcreate(h5t::H5T_STRING, 1);
let h5_type_id = *hdf5::globals::H5T_C_S1;
assert!(h5_type_id >= 0);
assert!(
h5t::H5Tregister(
h5t::H5T_PERS_SOFT,
"H5T_C_S1->rust::String\0".as_ptr() as _,
h5_type_id,
type_id,
Some(conv_func),
) >= 0
);
let file =
h5f::H5Fopen("test.h5\0".as_ptr() as *const _, h5f::H5F_ACC_RDONLY, h5p::H5P_DEFAULT);
assert!(file >= 0);
for name in &["a1\0", "a2\0"] {
println!("{}:", name);
let ds = h5d::H5Dopen2(file, name.as_ptr() as *const _, h5p::H5P_DEFAULT);
assert!(ds >= 0);
let space = h5d::H5Dget_space(ds);
assert!(space >= 0);
let npoints = h5s::H5Sget_simple_extent_npoints(space);
assert!(npoints >= 0);
let npoints = npoints as usize;
let layout =
alloc::Layout::from_size_align(npoints * STRING_SIZE, mem::align_of::<String>())?;
let buf = alloc::alloc(layout);
assert!(
H5Dread(ds, type_id, h5s::H5S_ALL, h5s::H5S_ALL, h5p::H5P_DEFAULT, buf as _,) >= 0
);
let vec = Vec::<String>::from_raw_parts(buf as _, npoints, npoints);
println!("{:#?}", vec);
let arr: Array1<String> = Array1::from_shape_vec_unchecked(npoints, vec);
println!("{:#?}", arr);
}
}
Ok(())
}
Output:
a1:
[
"abcdefghijklmnopqrstuvwxyz",
"ABCDEFGHIJKLMNOPQRSTUVWXYZ",
"",
"123",
"a",
]
["abcdefghijklmnopqrstuvwxyz", "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "", "123", "a"], shape=[5], strides=[1], layout=C | F (0x3), const ndim=1
a2:
[
"abc",
"1",
"",
"23",
]
["abc", "1", "", "23"], shape=[4], strides=[1], layout=C | F (0x3), const ndim=1
TLDR: we have an HDF5 dataset with type |S26
and we read it directly into a Vec<String>
and it sort of seems to work.
@magnusuMET There you go as promised ^ 😄
Just verified, the conversion routine indeed runs chunk by chunk. So, if you're converting a dataset with 1K strings but chunk size is 100, you will allocate memory for at most 1100 strings at a time (this would be the advantage as opposed to "read all, then convert" approach).
@aldanor That is some really great stuff! So it sort of acts as an inplace conversion? Nasty trick of copying the layout of the String :+1:
Yea, it is in-place in a sense that String
body (pretty hefty, 24B) is generated in place, obviously not the heap data it points to.
One could argue it's not the most efficient way of doing things etc, but given that it allows you to map directly to Rust types, I think convenience outweighs everything else. Typically, if you want performance, you won't be using strings at all in the first place :)
Note also that this would automatically work for structs as well, any String
field wrapped in a struct or array would automatically be decoded in place.
Just to add to the above so I don't forget, we could totally do something like that (I could probably take up on that once the dust settles over the current blockers), BUT: this will require splitting H5Type
into H5Read
and H5Write
. I.e., you can write &str
or String
but you can only read String
.
I've spent some time digging into HDF5 conversion API and it seems like it actually works! As in, we can force it to "understand" Rust string types and convert back and forth. Given the painful experience with strings and arrays (#86, #47, #85), this could be a huge win in usability.
The same can be done with varlen/fixed arrays/strings (direct conversions to/from
&[T]
,Vec<T>
,String
,&str
, etc).Price to pay: extra memory allocation. If the dataset is not chunked, it will (at some point in the conversion path) use double the required memory. If it is chunked, I think it will process it chunk by chunk so the cost could be negligible.
There's many details to consider and discuss, this is just a start and an experiment. Details below.