Open sabasehrish opened 4 years ago
I don't think it's so much the API that is restricting here, it are the intrinsic properties of HDF5 as it stores ND-arrays (that have a fixed number of columns per row, etc. for higher dimensions).
Personally I would store as a sparse matrix (e.g. as compressed sparse row, see wiki). In particular, you could store the data (v
) as the dataset, and add the row pointers (row_ptr
) and column index (col_ind
) as attributes.
I would not convert to strings, it will cost you space, you might loose accuracy, and surely it will be much slower.
I will look into sparse matrix. I am also okay with first approach I am using, provided that after reading back assembling the vector is not too costly.
HDF5 supports variable length arrays. It should be no problem to store a vector of vectors of variable length.
Here is how you do it with h5py (using cython wrapping the C-API), similar is possible with pytables (also using cython to work with the C-API).
import h5py
import numpy as np
dt = h5py.vlen_dtype(np.dtype('float64'))
N_ROWS = 100
with h5py.File('vlen.hdf5', 'w') as f:
dset = f.create_dataset('test', (N_ROWS, ), dtype=dt)
for i in range(N_ROWS):
N = np.random.poisson(25)
dset[i] = np.random.normal(size=N)
One of the limitations of both pytables and h5py is that they do not support variable length arrays in composed datatypes, because the deeply rely on numpy in handling those.
I am interested in lifting those limitations and currently exploring the best way how to do that. Since HighFIve
offers a considerably easier to wrap API (using pybind11) than the official C++ API, it would be great if HighFive
would support variable length dtypes everywhere.
This is how you write variable length using the official C++ API:
#include <iostream>
#include <string>
#include <H5Cpp.h>
#include <vector>
#include <random>
const hsize_t n_dims = 1;
const hsize_t n_rows = 100;
const std::string dataset_name = "test";
int main () {
H5::H5File file("vlen_cpp.hdf5", H5F_ACC_TRUNC);
H5::DataSpace dataspace(n_dims, &n_rows);
// target dtype for the file
H5::FloatType item_type(H5::PredType::IEEE_F64LE);
H5::VarLenType file_type(item_type);
// dtype of the generated data
H5::FloatType mem_item_type(H5::PredType::NATIVE_DOUBLE);
H5::VarLenType mem_type(item_type);
H5::DataSet dataset = file.createDataSet(dataset_name, file_type, dataspace);
std::vector<std::vector<double>> data;
data.reserve(n_rows);
// this structure stores length of each varlen row and a pointer to
// the actual data
std::vector<hvl_t> varlen_spec(n_rows);
std::mt19937 gen;
std::normal_distribution<double> normal(0.0, 1.0);
std::poisson_distribution<hsize_t> poisson(20);
for (hsize_t idx=0; idx < n_rows; idx++) {
data.emplace_back();
hsize_t size = poisson(gen);
data.at(idx).reserve(size);
varlen_spec.at(idx).len = size;
varlen_spec.at(idx).p = (void*) &data.at(idx).front();
for (hsize_t i = 0; i < size; i++) {
data.at(idx).push_back(normal(gen));
}
}
dataset.write(&varlen_spec.front(), mem_type);
return 0;
}
and here is how you'd write using a compound data type that has a variable length part:
#include <iostream>
#include <string>
#include <H5Cpp.h>
#include <vector>
#include <random>
const hsize_t n_dims = 1;
const hsize_t n_rows = 100;
const std::string dataset_name = "test";
struct CompoundData {
double x;
double y;
hvl_t values;
CompoundData(double x, double y) : x(x), y(y) {};
};
int main () {
H5::H5File file("vlen_compound.hdf5", H5F_ACC_TRUNC);
H5::DataSpace dataspace(n_dims, &n_rows);
// target dtype for the file
H5::CompType data_type(sizeof(CompoundData));
data_type.insertMember("x", HOFFSET(CompoundData, x), H5::PredType::NATIVE_DOUBLE);
data_type.insertMember("y", HOFFSET(CompoundData, y), H5::PredType::NATIVE_DOUBLE);
data_type.insertMember("values", HOFFSET(CompoundData, values), H5::VarLenType(H5::PredType::NATIVE_DOUBLE));
H5::DataSet dataset = file.createDataSet(dataset_name, data_type, dataspace);
// one vector holding the actual data
std::vector<std::vector<double>> values;
values.reserve(n_rows);
// and one holding the hdf5 description and the "simple" columns
std::vector<CompoundData> data;
data.reserve(n_rows);
std::mt19937 gen;
std::normal_distribution<double> normal(0.0, 1.0);
std::poisson_distribution<hsize_t> poisson(20);
for (hsize_t idx = 0; idx < n_rows; idx++) {
hsize_t size = poisson(gen);
values.emplace_back();
values.at(idx).reserve(size);
for (hsize_t i = 0; i < size; i++) {
values.at(idx).push_back(normal(gen));
}
// set len and pointer for the variable length descriptor
data.emplace_back(normal(gen), normal(gen));
data.at(idx).values.len = size;
data.at(idx).values.p = (void*) &values.at(idx).front();
}
dataset.write(&data.front(), data_type);
return 0;
}
@maxnoe What is your hdf5 version? I want to read H5T_VLEN type data from H5 file, but I can't find H5cpp.h when I use your demo Or, you still have a demo that uses highfive to read H5T_VLEN type data, that would be even better
@WanXinTao This issue is about adding support for vlen to highfive. So I cannot give you a solution using high five.
H5cpp.h is the c++ interface, which might need an additional package to be installed.
@WanXinTao This issue is about adding support for vlen to highfive. So I cannot give you a solution using high five.
H5cpp.h is the c++ interface, which might need an additional package to be installed.
I understand, thanks for your answer
@WanXinTao If you are on Ubuntu and have libhdf5-dev
installed, the header is here: /usr/include/hdf5/serial/H5Cpp.h
.
Which is also advertised when using pkg-config:
$ pkg-config hdf5 --cflags --libs
-I/usr/include/hdf5/serial -L/usr/lib/x86_64-linux-gnu/hdf5/serial -lhdf5
This is what I had to do to get my example to compile on ubuntu:
$ g++ write_compound_varlen.cxx -o write_compound_varlen `pkg-config hdf5 --cflags --libs` -lhdf5_cpp
@tdegeus Could you maybe change the tag from "question" to "enhancement" and adjust the title? Maybe to "Support variable length data types"?
Or should I open a new issue to track this feature request?
Done @maxnoe !
A first step would be to support simple variable length arrays, this is how you write one using h5py
:
import h5py
import numpy as np
data = np.array([
np.array(row, dtype=np.int32)
for row in [[1, 2, 3], [1, 2], [1, 2, 3, 4], [1, 2, 3, 4, 5]]
], dtype=object)
with h5py.File('varlen_array.h5', 'w') as f:
f.create_dataset(
'test',
dtype=h5py.vlen_dtype(np.int32),
data=data,
)
and this is how I'd like to read it using HighFive:
#include <highfive/H5File.hpp>
#include <cstdint>
#include <iostream>
#include <string>
#include <vector>
int main() {
HighFive::File file("varlen_array.h5", HighFive::File::ReadOnly);
std::vector<std::vector<int32_t>> data;
HighFive::DataSet dataset = file.getDataSet("test");
dataset.read(data);
for (const auto& row: data) {
for (auto val: row) {
std::cout << val << " ";
}
}
std::cout << '\n';
return 0;
}
Which compiles but then errors with this:
HighFive WARNING: data and hdf5 dataset have different types: Integer32 -> Varlen128
terminate called after throwing an instance of 'HighFive::DataSpaceException'
what(): Impossible to read DataSet of dimensions 1 into arrays of dimensions 2
[3] 80485 abort (core dumped) ./Debug/read_varlen_array
I am a bit lost where I could start with looking into this. If you'd give me a couple of pointers which parts of the code would need to be adapted, I'd be happy to give it a try.
Hi @maxnoe. The change you propose is significant, both in potential and amount of work :) Varlen arrays are a special kind which basically, like strings, store a pointer to the actual data arrays. Therefore we can't simply read the data, we have to take care of the indirection and subtract 1 to the dimensionality (hence the error you get!)
At BlueBrain we haven't really had use cases for VarLen arrays, so we can't dedicate much time to this. However if you are feeling brave to implement the change I will be happy to review it.
As mentioned varlen (traditional) strings are a good source of inspiration. See how they are handled in include/highfive/bits/H5Converter_misc.hpp :L349
Cheers
The change you propose is significant, both in potential and amount of work :)
I suspected as much. Still, I suspect it will be less work for me to contribute here than to basically duplicate all the work again to get a nicely pybind11 wrapable C++ interface supporting variable length arrays and compound data comprising variable length arrays.
So I'll try with the simple variable length arrays first and see how far I can go.
This is how you write variable length using the official C++ API:
#include <iostream> #include <string> #include <H5Cpp.h> #include <vector> #include <random> const hsize_t n_dims = 1; const hsize_t n_rows = 100; const std::string dataset_name = "test"; int main () { H5::H5File file("vlen_cpp.hdf5", H5F_ACC_TRUNC); H5::DataSpace dataspace(n_dims, &n_rows); // target dtype for the file H5::FloatType item_type(H5::PredType::IEEE_F64LE); H5::VarLenType file_type(item_type); // dtype of the generated data H5::FloatType mem_item_type(H5::PredType::NATIVE_DOUBLE); H5::VarLenType mem_type(item_type); H5::DataSet dataset = file.createDataSet(dataset_name, file_type, dataspace); std::vector<std::vector<double>> data; data.reserve(n_rows); // this structure stores length of each varlen row and a pointer to // the actual data std::vector<hvl_t> varlen_spec(n_rows); std::mt19937 gen; std::normal_distribution<double> normal(0.0, 1.0); std::poisson_distribution<hsize_t> poisson(20); for (hsize_t idx=0; idx < n_rows; idx++) { data.emplace_back(); hsize_t size = poisson(gen); data.at(idx).reserve(size); varlen_spec.at(idx).len = size; varlen_spec.at(idx).p = (void*) &data.at(idx).front(); for (hsize_t i = 0; i < size; i++) { data.at(idx).push_back(normal(gen)); } } dataset.write(&varlen_spec.front(), mem_type); return 0; }
Hello @maxnoe,
Thank you very much for your example on how to write variable length data using the c++ API, it works great, and I've used it to create some datasets I need for work. Would it be possible for you to provide a similar example on how to read that data back into std::vector
Specifically, in python if I want to access one particular array in the dataset, say the sixth one, I would do
import h5py
vlenData = h5py.File("vlen_cpp.hdf5", "r")
sixthArray = vlenData["test"][5]
But I don't know how this works in c++. Any advice would be very much appreciated.
Thank you, Nate
Hello again @maxnoe,
To follow up a bit, here is what I tried. After writing your "vlen_cpp.hdf5", which works just fine, I want to read the hdf5 file and load the data back into some kind of container (doesn't really matter what for now). I tried reading the first row of the hdf5 file into various containers (array, arma::vec, Eigen::VectorXd), none of which work. The program below happily executes but what is read into the containers is just garbage.
If you have any ideas, that would be wonderful.
Thank you! Nate
#include <H5Cpp.h>
#include <string>
#include <vector>
#include <stdio.h>
#include <Eigen/Dense>
#include <Eigen/Core>
#include <armadillo>
int main(int argc, char **argv) {
std::string filename = argv[1];
// memtype of the file
auto itemType = H5::PredType::NATIVE_DOUBLE;
auto memType = H5::VarLenType(&itemType);
// get dataspace
H5::H5File file(filename, H5F_ACC_RDONLY);
H5::DataSet dataset = file.openDataSet("test");
H5::DataSpace dataspace = dataset.getSpace();
// get the size of the dataset
hsize_t rank;
hsize_t dims[1];
rank = dataspace.getSimpleExtentDims(dims); // rank = 1
std::cout << "Data size: "<< dims[0] << std::endl; // this is the correct number of values
std::cout << "Data rank: "<< rank << std::endl; // this is the correct rank
// create memspace
hsize_t memDims[1] = {1};
H5::DataSpace memspace(rank, memDims);
// Select hyperslabs
hsize_t dataCount[1] = {1};
hsize_t dataOffset[1] = {0}; // this would be i if reading in a loop
hsize_t memCount[1] = {1};
hsize_t memOffset[1] = {0};
dataspace.selectHyperslab(H5S_SELECT_SET, dataCount, dataOffset);
memspace.selectHyperslab(H5S_SELECT_SET, memCount, memOffset);
// Create storage to hold read data
int i;
int NX = 20;
double data_out[NX];
for (i = 0; i < NX; i++)
data_out[i] = 0;
arma::vec temp(20);
Eigen::VectorXd temp2(20);
// Read data into data_out (array)
dataset.read(data_out, memType, memspace, dataspace);
std::cout << "data_out: " << "\n";
for (i = 0; i < NX; i++)
std::cout << data_out[i] << " ";
std::cout << std::endl;
// Read data into temp (arma vec)
dataset.read(temp.memptr(), memType, memspace, dataspace);
std::cout << "arma vec: " << "\n";
std::cout << temp << std::endl;
// Read data into temp (eigen vec)
dataset.read(temp2.data(), memType, memspace, dataspace);
std::cout << "eigen vec: " << "\n";
std::cout << temp2 << std::endl;
return 0;
}
Oddly, in python this gives me no issues whatsoever:
import h5py
import numpy as np
data = h5py.File("vlen_cpp.hdf5", "r")
i = 0 # This is the row I would want to read
arr = data["test"][i] # <-- This is the simplest way.
# Now trying to mimic something closer to C++
did = data["test"].id
dataspace = did.get_space()
dataspace.select_hyperslab(start=(i, ), count=(1, ))
memspace = h5py.h5s.create_simple(dims_tpl=(1, ))
memspace.select_hyperslab(start=(0, ), count=(1, ))
arr = np.zeros((1, ), dtype=object)
did.read(memspace, dataspace, arr)
print(arr) # This gives back the correct data
Hello again @maxnoe,
I finally figured out a way to do it:
#include <H5Cpp.h>
#include <string>
#include <vector>
#include <stdio.h>
int main(int argc, char **argv) {
std::string filename = argv[1];
// memtype of the file
auto itemType = H5::PredType::NATIVE_DOUBLE;
auto memType = H5::VarLenType(&itemType);
// get dataspace
H5::H5File file(filename, H5F_ACC_RDONLY);
H5::DataSet dataset = file.openDataSet("test");
H5::DataSpace dataspace = dataset.getSpace();
// get the size of the dataset
hsize_t rank;
hsize_t dims[1];
rank = dataspace.getSimpleExtentDims(dims); // rank = 1
std::cout << "Data size: "<< dims[0] << std::endl; // this is the correct number of values
std::cout << "Data rank: "<< rank << std::endl; // this is the correct rank
// create memspace
hsize_t memDims[1] = {1};
H5::DataSpace memspace(rank, memDims);
// Initialize hyperslabs
hsize_t dataCount[1];
hsize_t dataOffset[1];
hsize_t memCount[1];
hsize_t memOffset[1];
// Create storage to hold read data
std::vector<std::vector<double>> dataOut;
for (hsize_t i = 0; i < dims[0]; i++) {
// Select hyperslabs
dataCount[0] = 1;
dataOffset[0] = i;
memCount[0] = 1;
memOffset[0] = 0;
dataspace.selectHyperslab(H5S_SELECT_SET, dataCount, dataOffset);
memspace.selectHyperslab(H5S_SELECT_SET, memCount, memOffset);
hvl_t *rdata = new hvl_t[1];
dataset.read(rdata, memType, memspace, dataspace);
double* ptr = (double*)rdata[0].p;
std::vector<double> thisRow;
for (int j = 0; j < rdata[0].len; j++) {
double* val = (double*)&ptr[j];
thisRow.push_back(*val);
}
dataOut.push_back(thisRow);
}
for (int i = 0; i < dataOut.size(); i++) {
std::cout << "Row " << i << ":\n";
for (int j = 0; j < dataOut[i].size(); j++) {
std::cout << dataOut[i][j] << " ";
}
std::cout << "\n";
}
return 0;
}
If you know of a more efficient way that can pull out an entire row in one go (as you can see I'm looping over the elements), that would be so helpful. But I'm happy enough with this for now.
Thanks, Nate
Hi @motiv-ncb, if you want to just read everything into a std::vector<std::vector<double>>
, I came up with this:
#include <iostream>
#include <string>
#include <H5Cpp.h>
#include <vector>
#include <random>
int main () {
H5::H5File file("vlen_cpp.hdf5", H5F_ACC_RDONLY);
H5::DataSet dataset {file.openDataSet("test")};
H5::DataSpace dataspace = dataset.getSpace();
const int n_dims = dataspace.getSimpleExtentNdims();
std::vector<hsize_t> dims(n_dims);
dataspace.getSimpleExtentDims(dims.data());
std::cout << "n_dims: " << dims.size() << '\n';
std::cout << "shape: (";
for (hsize_t dim: dims) {
std::cout << dim << ", ";
}
std::cout << ")\n";
if (dims.size() != 1) {
throw std::runtime_error("Unexpected dimensions");
}
const hsize_t n_rows = dims[0];
std::vector<hvl_t> varlen_specs(n_rows);
std::vector<std::vector<double>> data;
data.reserve(n_rows);
H5::FloatType mem_item_type(H5::PredType::NATIVE_DOUBLE);
H5::VarLenType mem_type(mem_item_type);
dataset.read(varlen_specs.data(), mem_type);
for (const auto& varlen_spec: varlen_specs) {
auto data_ptr = static_cast<double*>(varlen_spec.p);
data.emplace_back(data_ptr, data_ptr + varlen_spec.len);
H5free_memory(varlen_spec.p);
}
return 0;
}
Thank you very much @maxnoe!
Hey, due to the new type system, I'm looking back to this error.
What will be the HighFive public API to write such vlen vector?
Regards
I'd prefer not blocking v3
for this and I don't think we have the time to implement it soon.
您好,感谢您给我来信,您的邮件我已收到。Thanks
For an application code I am working with, I have data represented as std::vector<std::vector>, where each std::vector has different length. Here are three ways I have used so far, and interested in knowing better or more efficient approach using HighFive API, may be use of variable length datatypes, and if there are any examples.
Flatten the structure, and use an std::vector representation for one HDF5 1D dataset, where each element is a char, and then have another dataset capturing size of each inner vector of chars to know how many elements will give me back the original vector upon reading.
Keep the vector of vector structure, and for HDF5 dataset use 2D dataset of chars and make use of dataset resize to adjust the second dimension before writing, this seems very in-efficient since I am calling resize and write for each inner vector.
Use std::vector, which seemed to be the most straightforward but I am unable to make it work because after writing a string, all I see is a single char written to the file per element instead of complete string.