BlueBrain / HighFive

HighFive - Header-only C++ HDF5 interface
https://bluebrain.github.io/HighFive/
Boost Software License 1.0
670 stars 159 forks source link

Support variable length data types #369

Open sabasehrish opened 3 years ago

sabasehrish commented 3 years ago

For an application code I am working with, I have data represented as std::vector<std::vector>, where each std::vector has different length. Here are three ways I have used so far, and interested in knowing better or more efficient approach using HighFive API, may be use of variable length datatypes, and if there are any examples.

tdegeus commented 3 years ago

I don't think it's so much the API that is restricting here, it are the intrinsic properties of HDF5 as it stores ND-arrays (that have a fixed number of columns per row, etc. for higher dimensions).

Personally I would store as a sparse matrix (e.g. as compressed sparse row, see wiki). In particular, you could store the data (v) as the dataset, and add the row pointers (row_ptr) and column index (col_ind) as attributes.

I would not convert to strings, it will cost you space, you might loose accuracy, and surely it will be much slower.

sabasehrish commented 3 years ago

I will look into sparse matrix. I am also okay with first approach I am using, provided that after reading back assembling the vector is not too costly.

maxnoe commented 3 years ago

HDF5 supports variable length arrays. It should be no problem to store a vector of vectors of variable length.

maxnoe commented 3 years ago

Here is how you do it with h5py (using cython wrapping the C-API), similar is possible with pytables (also using cython to work with the C-API).

import h5py
import numpy as np

dt = h5py.vlen_dtype(np.dtype('float64'))

N_ROWS = 100

with h5py.File('vlen.hdf5', 'w') as f:
    dset = f.create_dataset('test', (N_ROWS, ), dtype=dt)

    for i in range(N_ROWS):
        N = np.random.poisson(25)
        dset[i] = np.random.normal(size=N)

One of the limitations of both pytables and h5py is that they do not support variable length arrays in composed datatypes, because the deeply rely on numpy in handling those.

I am interested in lifting those limitations and currently exploring the best way how to do that. Since HighFIve offers a considerably easier to wrap API (using pybind11) than the official C++ API, it would be great if HighFive would support variable length dtypes everywhere.

This is how you write variable length using the official C++ API:

#include <iostream>
#include <string>
#include <H5Cpp.h>
#include <vector>
#include <random>

const hsize_t n_dims = 1;
const hsize_t n_rows = 100;
const std::string dataset_name = "test";

int main () {
    H5::H5File file("vlen_cpp.hdf5", H5F_ACC_TRUNC);

    H5::DataSpace dataspace(n_dims, &n_rows);

    // target dtype for the file
    H5::FloatType item_type(H5::PredType::IEEE_F64LE);
    H5::VarLenType file_type(item_type);

    // dtype of the generated data
    H5::FloatType mem_item_type(H5::PredType::NATIVE_DOUBLE);
    H5::VarLenType mem_type(item_type);

    H5::DataSet dataset = file.createDataSet(dataset_name, file_type, dataspace);

    std::vector<std::vector<double>> data;
    data.reserve(n_rows);

    // this structure stores length of each varlen row and a pointer to
    // the actual data
    std::vector<hvl_t> varlen_spec(n_rows);

    std::mt19937 gen;
    std::normal_distribution<double> normal(0.0, 1.0);
    std::poisson_distribution<hsize_t> poisson(20);

    for (hsize_t idx=0; idx < n_rows; idx++) {
        data.emplace_back();

        hsize_t size = poisson(gen);
        data.at(idx).reserve(size);

        varlen_spec.at(idx).len = size;
        varlen_spec.at(idx).p = (void*) &data.at(idx).front();

        for (hsize_t i = 0; i < size; i++) {
            data.at(idx).push_back(normal(gen));
        }
    }

    dataset.write(&varlen_spec.front(), mem_type);

    return 0;
}

and here is how you'd write using a compound data type that has a variable length part:

#include <iostream>
#include <string>
#include <H5Cpp.h>
#include <vector>
#include <random>

const hsize_t n_dims = 1;
const hsize_t n_rows = 100;
const std::string dataset_name = "test";

struct CompoundData {
    double x;
    double y;
    hvl_t  values;

    CompoundData(double x, double y) : x(x), y(y) {};
};

int main () {
    H5::H5File file("vlen_compound.hdf5", H5F_ACC_TRUNC);

    H5::DataSpace dataspace(n_dims, &n_rows);

    // target dtype for the file
    H5::CompType data_type(sizeof(CompoundData));
    data_type.insertMember("x", HOFFSET(CompoundData, x), H5::PredType::NATIVE_DOUBLE);
    data_type.insertMember("y", HOFFSET(CompoundData, y), H5::PredType::NATIVE_DOUBLE);
    data_type.insertMember("values", HOFFSET(CompoundData, values), H5::VarLenType(H5::PredType::NATIVE_DOUBLE));

    H5::DataSet dataset = file.createDataSet(dataset_name, data_type, dataspace);

    // one vector holding the actual data
    std::vector<std::vector<double>> values;
    values.reserve(n_rows);

    // and one holding the hdf5 description and the "simple" columns
    std::vector<CompoundData> data;
    data.reserve(n_rows);

    std::mt19937 gen;
    std::normal_distribution<double> normal(0.0, 1.0);
    std::poisson_distribution<hsize_t> poisson(20);

    for (hsize_t idx = 0; idx < n_rows; idx++) {
        hsize_t size = poisson(gen);
        values.emplace_back();
        values.at(idx).reserve(size);

        for (hsize_t i = 0; i < size; i++) {
            values.at(idx).push_back(normal(gen));
        }

        // set len and pointer for the variable length descriptor
        data.emplace_back(normal(gen), normal(gen));
        data.at(idx).values.len = size;
        data.at(idx).values.p = (void*) &values.at(idx).front();
    }

    dataset.write(&data.front(), data_type);

    return 0;
}
WanXinTao commented 3 years ago

@maxnoe What is your hdf5 version? I want to read H5T_VLEN type data from H5 file, but I can't find H5cpp.h when I use your demo Or, you still have a demo that uses highfive to read H5T_VLEN type data, that would be even better

maxnoe commented 3 years ago

@WanXinTao This issue is about adding support for vlen to highfive. So I cannot give you a solution using high five.

H5cpp.h is the c++ interface, which might need an additional package to be installed.

WanXinTao commented 3 years ago

@WanXinTao This issue is about adding support for vlen to highfive. So I cannot give you a solution using high five.

H5cpp.h is the c++ interface, which might need an additional package to be installed.

I understand, thanks for your answer

maxnoe commented 3 years ago

@WanXinTao If you are on Ubuntu and have libhdf5-dev installed, the header is here: /usr/include/hdf5/serial/H5Cpp.h.

Which is also advertised when using pkg-config:

$ pkg-config hdf5 --cflags --libs
-I/usr/include/hdf5/serial -L/usr/lib/x86_64-linux-gnu/hdf5/serial -lhdf5

This is what I had to do to get my example to compile on ubuntu:

$ g++ write_compound_varlen.cxx -o write_compound_varlen  `pkg-config hdf5 --cflags --libs` -lhdf5_cpp
maxnoe commented 3 years ago

@tdegeus Could you maybe change the tag from "question" to "enhancement" and adjust the title? Maybe to "Support variable length data types"?

Or should I open a new issue to track this feature request?

tdegeus commented 3 years ago

Done @maxnoe !

maxnoe commented 3 years ago

A first step would be to support simple variable length arrays, this is how you write one using h5py:

import h5py
import numpy as np

data = np.array([
    np.array(row, dtype=np.int32)
    for row in [[1, 2, 3], [1, 2], [1, 2, 3, 4], [1, 2, 3, 4, 5]]
], dtype=object)

with h5py.File('varlen_array.h5', 'w') as f:
    f.create_dataset(
        'test',
        dtype=h5py.vlen_dtype(np.int32),
        data=data,
    )

and this is how I'd like to read it using HighFive:

#include <highfive/H5File.hpp>
#include <cstdint>
#include <iostream>
#include <string>
#include <vector>

int main() {
    HighFive::File file("varlen_array.h5", HighFive::File::ReadOnly);

    std::vector<std::vector<int32_t>> data;
    HighFive::DataSet dataset = file.getDataSet("test");
    dataset.read(data);

    for (const auto& row: data) {
        for (auto val: row) {
            std::cout << val << " ";
        }
    }
    std::cout << '\n';
    return 0;
}

Which compiles but then errors with this:

HighFive WARNING: data and hdf5 dataset have different types: Integer32 -> Varlen128
terminate called after throwing an instance of 'HighFive::DataSpaceException'
  what():  Impossible to read DataSet of dimensions 1 into arrays of dimensions 2
[3]    80485 abort (core dumped)  ./Debug/read_varlen_array
maxnoe commented 3 years ago

I am a bit lost where I could start with looking into this. If you'd give me a couple of pointers which parts of the code would need to be adapted, I'd be happy to give it a try.

ferdonline commented 3 years ago

Hi @maxnoe. The change you propose is significant, both in potential and amount of work :) Varlen arrays are a special kind which basically, like strings, store a pointer to the actual data arrays. Therefore we can't simply read the data, we have to take care of the indirection and subtract 1 to the dimensionality (hence the error you get!)

At BlueBrain we haven't really had use cases for VarLen arrays, so we can't dedicate much time to this. However if you are feeling brave to implement the change I will be happy to review it. As mentioned varlen (traditional) strings are a good source of inspiration. See how they are handled in include/highfive/bits/H5Converter_misc.hpp :L349 Cheers

maxnoe commented 3 years ago

The change you propose is significant, both in potential and amount of work :)

I suspected as much. Still, I suspect it will be less work for me to contribute here than to basically duplicate all the work again to get a nicely pybind11 wrapable C++ interface supporting variable length arrays and compound data comprising variable length arrays.

So I'll try with the simple variable length arrays first and see how far I can go.

motiv-ncb commented 2 years ago

This is how you write variable length using the official C++ API:

#include <iostream>
#include <string>
#include <H5Cpp.h>
#include <vector>
#include <random>

const hsize_t n_dims = 1;
const hsize_t n_rows = 100;
const std::string dataset_name = "test";

int main () {
    H5::H5File file("vlen_cpp.hdf5", H5F_ACC_TRUNC);

    H5::DataSpace dataspace(n_dims, &n_rows);

    // target dtype for the file
    H5::FloatType item_type(H5::PredType::IEEE_F64LE);
    H5::VarLenType file_type(item_type);

    // dtype of the generated data
    H5::FloatType mem_item_type(H5::PredType::NATIVE_DOUBLE);
    H5::VarLenType mem_type(item_type);

    H5::DataSet dataset = file.createDataSet(dataset_name, file_type, dataspace);

    std::vector<std::vector<double>> data;
    data.reserve(n_rows);

    // this structure stores length of each varlen row and a pointer to
    // the actual data
    std::vector<hvl_t> varlen_spec(n_rows);

    std::mt19937 gen;
    std::normal_distribution<double> normal(0.0, 1.0);
    std::poisson_distribution<hsize_t> poisson(20);

    for (hsize_t idx=0; idx < n_rows; idx++) {
        data.emplace_back();

        hsize_t size = poisson(gen);
        data.at(idx).reserve(size);

        varlen_spec.at(idx).len = size;
        varlen_spec.at(idx).p = (void*) &data.at(idx).front();

        for (hsize_t i = 0; i < size; i++) {
            data.at(idx).push_back(normal(gen));
        }
    }

    dataset.write(&varlen_spec.front(), mem_type);

    return 0;
}

Hello @maxnoe,

Thank you very much for your example on how to write variable length data using the c++ API, it works great, and I've used it to create some datasets I need for work. Would it be possible for you to provide a similar example on how to read that data back into std::vector using the c++ API? In python it's very simple, but I'm having a lot of trouble finding an example on how to do it with c++.

Specifically, in python if I want to access one particular array in the dataset, say the sixth one, I would do

import h5py
vlenData = h5py.File("vlen_cpp.hdf5", "r")
sixthArray = vlenData["test"][5]

But I don't know how this works in c++. Any advice would be very much appreciated.

Thank you, Nate

motiv-ncb commented 2 years ago

Hello again @maxnoe,

To follow up a bit, here is what I tried. After writing your "vlen_cpp.hdf5", which works just fine, I want to read the hdf5 file and load the data back into some kind of container (doesn't really matter what for now). I tried reading the first row of the hdf5 file into various containers (array, arma::vec, Eigen::VectorXd), none of which work. The program below happily executes but what is read into the containers is just garbage.

If you have any ideas, that would be wonderful.

Thank you! Nate

#include <H5Cpp.h>
#include <string>
#include <vector>
#include <stdio.h>
#include <Eigen/Dense>
#include <Eigen/Core>
#include <armadillo>

int main(int argc, char **argv) {

    std::string filename = argv[1];

    // memtype of the file
    auto itemType = H5::PredType::NATIVE_DOUBLE;
    auto memType = H5::VarLenType(&itemType);

    // get dataspace
    H5::H5File file(filename, H5F_ACC_RDONLY);
    H5::DataSet dataset = file.openDataSet("test");
    H5::DataSpace dataspace = dataset.getSpace();

    // get the size of the dataset
    hsize_t rank;
    hsize_t dims[1];
    rank = dataspace.getSimpleExtentDims(dims); // rank = 1
    std::cout << "Data size: "<< dims[0] << std::endl; // this is the correct number of values
    std::cout << "Data rank: "<< rank << std::endl; // this is the correct rank

    // create memspace
    hsize_t memDims[1] = {1};
    H5::DataSpace memspace(rank, memDims);

    // Select hyperslabs
    hsize_t dataCount[1] = {1};
    hsize_t dataOffset[1] = {0};  // this would be i if reading in a loop
    hsize_t memCount[1] = {1};
    hsize_t memOffset[1] = {0};

    dataspace.selectHyperslab(H5S_SELECT_SET, dataCount, dataOffset);
    memspace.selectHyperslab(H5S_SELECT_SET, memCount, memOffset);

    // Create storage to hold read data
    int i;
    int NX = 20;
    double data_out[NX];
    for (i = 0; i < NX; i++)
        data_out[i] = 0;
    arma::vec temp(20);
    Eigen::VectorXd temp2(20);

    // Read data into data_out (array)
    dataset.read(data_out, memType, memspace, dataspace);

    std::cout << "data_out: " << "\n";
    for (i = 0; i < NX; i++)
        std::cout << data_out[i] << " ";
    std::cout << std::endl;

    // Read data into temp (arma vec)
    dataset.read(temp.memptr(), memType, memspace, dataspace);

    std::cout << "arma vec: " << "\n";
    std::cout << temp << std::endl;

    // Read data into temp (eigen vec)
    dataset.read(temp2.data(), memType, memspace, dataspace);

    std::cout << "eigen vec: " << "\n";
    std::cout << temp2 << std::endl;

    return 0;
}

Oddly, in python this gives me no issues whatsoever:

import h5py
import numpy as np

data = h5py.File("vlen_cpp.hdf5", "r")
i = 0  # This is the row I would want to read
arr = data["test"][i]  # <-- This is the simplest way.    

# Now trying to mimic something closer to C++
did = data["test"].id
dataspace = did.get_space()
dataspace.select_hyperslab(start=(i, ), count=(1, ))
memspace = h5py.h5s.create_simple(dims_tpl=(1, ))
memspace.select_hyperslab(start=(0, ), count=(1, ))
arr = np.zeros((1, ), dtype=object)
did.read(memspace, dataspace, arr)
print(arr)  # This gives back the correct data
motiv-ncb commented 2 years ago

Hello again @maxnoe,

I finally figured out a way to do it:

#include <H5Cpp.h>
#include <string>
#include <vector>
#include <stdio.h>

int main(int argc, char **argv) {

    std::string filename = argv[1];

    // memtype of the file
    auto itemType = H5::PredType::NATIVE_DOUBLE;
    auto memType = H5::VarLenType(&itemType);

    // get dataspace
    H5::H5File file(filename, H5F_ACC_RDONLY);
    H5::DataSet dataset = file.openDataSet("test");
    H5::DataSpace dataspace = dataset.getSpace();

    // get the size of the dataset
    hsize_t rank;
    hsize_t dims[1];
    rank = dataspace.getSimpleExtentDims(dims); // rank = 1
    std::cout << "Data size: "<< dims[0] << std::endl; // this is the correct number of values
    std::cout << "Data rank: "<< rank << std::endl; // this is the correct rank

    // create memspace
    hsize_t memDims[1] = {1};
    H5::DataSpace memspace(rank, memDims);

    // Initialize hyperslabs
    hsize_t dataCount[1];
    hsize_t dataOffset[1];
    hsize_t memCount[1];
    hsize_t memOffset[1];

    // Create storage to hold read data
    std::vector<std::vector<double>> dataOut;

    for (hsize_t i = 0; i < dims[0]; i++) {

        // Select hyperslabs
        dataCount[0] = 1;
        dataOffset[0] = i;
        memCount[0] = 1;
        memOffset[0] = 0;

        dataspace.selectHyperslab(H5S_SELECT_SET, dataCount, dataOffset);
        memspace.selectHyperslab(H5S_SELECT_SET, memCount, memOffset);

        hvl_t *rdata = new hvl_t[1];
        dataset.read(rdata, memType, memspace, dataspace);

        double* ptr = (double*)rdata[0].p;
        std::vector<double> thisRow;

        for (int j = 0; j < rdata[0].len; j++) {
            double* val = (double*)&ptr[j];
            thisRow.push_back(*val);
        }

        dataOut.push_back(thisRow);
    }

    for (int i = 0; i < dataOut.size(); i++) {
        std::cout << "Row " << i << ":\n";
        for (int j = 0; j < dataOut[i].size(); j++) {
            std::cout << dataOut[i][j] << " ";
        }
        std::cout << "\n";
    }

    return 0;
}

If you know of a more efficient way that can pull out an entire row in one go (as you can see I'm looping over the elements), that would be so helpful. But I'm happy enough with this for now.

Thanks, Nate

maxnoe commented 2 years ago

Hi @motiv-ncb, if you want to just read everything into a std::vector<std::vector<double>>, I came up with this:

#include <iostream>
#include <string>
#include <H5Cpp.h>
#include <vector>
#include <random>

int main () {
    H5::H5File file("vlen_cpp.hdf5", H5F_ACC_RDONLY);
    H5::DataSet dataset {file.openDataSet("test")};

    H5::DataSpace dataspace = dataset.getSpace();
    const int n_dims = dataspace.getSimpleExtentNdims();
    std::vector<hsize_t> dims(n_dims);
    dataspace.getSimpleExtentDims(dims.data());

    std::cout << "n_dims: " << dims.size() << '\n';

    std::cout << "shape: (";
    for (hsize_t dim: dims) {
        std::cout << dim << ", ";
    }
    std::cout << ")\n";

    if (dims.size() != 1) {
        throw std::runtime_error("Unexpected dimensions");
    }

    const hsize_t n_rows = dims[0];
    std::vector<hvl_t> varlen_specs(n_rows);
    std::vector<std::vector<double>> data;
    data.reserve(n_rows);

    H5::FloatType mem_item_type(H5::PredType::NATIVE_DOUBLE);
    H5::VarLenType mem_type(mem_item_type);
    dataset.read(varlen_specs.data(), mem_type);

    for (const auto& varlen_spec: varlen_specs) {
        auto data_ptr = static_cast<double*>(varlen_spec.p);
        data.emplace_back(data_ptr, data_ptr + varlen_spec.len);
        H5free_memory(varlen_spec.p);
    }

    return 0;
}
motiv-ncb commented 2 years ago

Thank you very much @maxnoe!

alkino commented 2 years ago

Hey, due to the new type system, I'm looking back to this error.

What will be the HighFive public API to write such vlen vector?

Regards

1uc commented 2 months ago

I'd prefer not blocking v3 for this and I don't think we have the time to implement it soon.

WanXinTao commented 2 months ago

您好,感谢您给我来信,您的邮件我已收到。Thanks