Apollo3zehn / PureHDF

A pure .NET library that makes reading and writing of HDF5 files (groups, datasets, attributes, ...) very easy.
MIT License
47 stars 16 forks source link

[Question] Reading arrays in compound dataset? #10

Closed nprezant closed 1 year ago

nprezant commented 1 year ago

Hi! I'm wondering if there are any examples or support for reading arrays in a compound dataset?

Something like this:

internal struct QUAD_CN
{
    public int ID;
    public string TERM; // length: 8
    public int[] GRID; // length: 5
    public float[] FD1; // length: 5
}
Apollo3zehn commented 1 year ago

Hi, if you enable unsafe code, you can define your struct like a C struct with inline arrays of fixed length. Something like this (untested, and probably a StructLayout attribute is required to better define the byte offsets):

unsafe struct QUAD_CN
{
    public int ID;
    public fixed byte TERM[8];
    public fixed int GRID[5];
    public fixed float FD1[5];
}

But I am not sure if this will work. If you can attach an example file, I will try to give you a more accurate answer.

nprezant commented 1 year ago

Thank you! I've created and attached a small test file that should be in the right format.

quad_cn.zip (Wouldn't let me upload the .h5 file, so I zipped it up)

For reference, the example h5 file was generated with the following C code:

// write_quad_cn.c

#include "hdf5.h"
#include <stdlib.h>

int ex();

void main() {
    ex();
}

#define FILE       "quad_cn.h5"
#define TABLE_NAME "QUAD_CN"
#define RANK 1
#define LENGTH 10

typedef struct QUAD_CN {
    long long eid;
    char term[8];
    long long grid[5];
    double fd1[5];
} QUAD_CN;

int ex () {
    herr_t status = 0;

    QUAD_CN* buf = NULL;
    buf = (QUAD_CN*)malloc(LENGTH * sizeof(QUAD_CN));
    for (int i=0; i<LENGTH; ++i) {
        buf[i].eid = 1000 + i;
        buf[i].term[0] = 't';
        buf[i].term[1] = 'e';
        buf[i].term[2] = 'r';
        buf[i].term[3] = 'm';
        buf[i].term[4] = i + '0';
        buf[i].term[5] = '\0';
        buf[i].term[6] = '\0';
        buf[i].term[7] = '\0';
        buf[i].grid[0] = 100 + i + 0;
        buf[i].grid[1] = 100 + i + 1;
        buf[i].grid[2] = 100 + i + 2;
        buf[i].grid[3] = 100 + i + 3;
        buf[i].grid[4] = 100 + i + 4;
        buf[i].fd1[0] = 200 + i + 0 * 0.1;
        buf[i].fd1[1] = 200 + i + 1 * 0.1;
        buf[i].fd1[2] = 200 + i + 2 * 0.1;
        buf[i].fd1[3] = 200 + i + 3 * 0.1;
        buf[i].fd1[4] = 200 + i + 4 * 0.1;
    }

    size_t array_dims[] = {5};

    hid_t  str_tid = H5Tcopy(H5T_C_S1);
    size_t size = 8 * sizeof(char);
    status = H5Tset_size(str_tid, size);

    hsize_t dim[] = {LENGTH}; /* Dataspace dimensions */

    hid_t space = H5Screate_simple(RANK, dim, NULL);

    hid_t file = H5Fcreate(FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

    // Memory data type
    hid_t quad_cn_tid = H5Tcreate(H5T_COMPOUND, sizeof(QUAD_CN));
    H5Tinsert(quad_cn_tid, "eid", HOFFSET(QUAD_CN, eid), H5T_NATIVE_LLONG);
    H5Tinsert(quad_cn_tid, "term", HOFFSET(QUAD_CN, term), str_tid);
    H5Tinsert(quad_cn_tid, "grid", HOFFSET(QUAD_CN, grid), H5Tarray_create(H5T_NATIVE_LLONG, 1, array_dims));
    H5Tinsert(quad_cn_tid, "fd1", HOFFSET(QUAD_CN, fd1), H5Tarray_create(H5T_NATIVE_DOUBLE, 1, array_dims));

    hid_t dataset = H5Dcreate(file, TABLE_NAME, quad_cn_tid, space, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);

    // Write data to the data set
    status = H5Dwrite(dataset, quad_cn_tid, H5S_ALL, H5S_ALL, H5P_DEFAULT, buf);

    if (buf!=NULL) free(buf);
    H5Tclose(quad_cn_tid);
    H5Sclose(space);
    H5Dclose(dataset);
    H5Fclose(file);

    return 0;
}
Apollo3zehn commented 1 year ago

Perfect, I will look at it tomorrow.

Apollo3zehn commented 1 year ago

I found two ways to read nested arrays of fixed length:

1. Method

This method copies the raw data directly from the H5 file into the memory. This works when the C# struct layout matches the H5 struct layout exactly:

[StructLayout(LayoutKind.Explicit)]
unsafe struct QUAD_CN_FAST
{
    [FieldOffset(0)]
    public int eid;

    [FieldOffset(8)]
    public fixed byte term[8];

    [FieldOffset(16)]
    public fixed long grid[5];

    [FieldOffset(56)]
    public fixed double fd1[5];
}

I got the offsets by debugging HDF5.NET (dataset.InternalDataType). If you use the Nuget version, there is currently no easy way to get the offsets.

You can then read the array and print its contents:

using var root = H5File.OpenRead(filePath);
var dataset = root.Dataset("QUAD_CN");

// QUAD_CN_FAST
var data1 = dataset.Read<QUAD_CN_FAST>();

unsafe 
{
    foreach (var quad_cn in data1.Take(2))
    {
        var term = Encoding.ASCII.GetString(quad_cn.term, 8).TrimEnd((Char)0);
        var grid = new Span<long>(quad_cn.grid, 5);
        var fd1 = new Span<double>(quad_cn.fd1, 5);

        // print data
        Debug.WriteLine(quad_cn.eid);
        Debug.WriteLine(term);
        Debug.WriteLine(string.Join('|', grid.ToArray().Select(value => value.ToString())));
        Debug.WriteLine(string.Join('|', fd1.ToArray().Select(value => value.ToString())));
        Debug.WriteLine($"========");
    }
}

As you can see, you need to convert the pointers (e.g quad_cn.grid) to a Span<T> first before you can work with the data more conveniently.

2. Method

An easier approach is to use the slightly slower method where the C# struct field names must match those of the H5 file. Here the data is copied from the H5 file into an internal array first and then distributed to each C# struct field. This method already supported reading strings and now it also supports reading nested arrays of fixed length. To make the array fixed length, you need to specify the MarshalAs attribute on the C# struct:

struct QUAD_CN_SLOW
{
    public int eid;

    public string term;

    [MarshalAs(UnmanagedType.ByValArray, SizeConst = 5)]
    public long[] grid;

    [MarshalAs(UnmanagedType.ByValArray, SizeConst = 5)]
    public double[] fd1;
}

Reading the data is then very easy, since no additional conversion are required on the user side (using dataset.ReadCompound<T>()):

// QUAD_CN_SLOW
var data2 = dataset.ReadCompound<QUAD_CN_SLOW>();

// print data
foreach (var quad_cn in data2.Take(2))
{
    Debug.WriteLine(quad_cn.eid);
    Debug.WriteLine(quad_cn.term);
    Debug.WriteLine(string.Join('|', quad_cn.grid.ToArray().Select(value => value.ToString())));
    Debug.WriteLine(string.Join('|', quad_cn.fd1.ToArray().Select(value => value.ToString())));
    Debug.WriteLine($"========");
}

Note: This methods requires the newest version of HDF5.NET (https://www.nuget.org/packages/HDF5.NET/1.0.0-alpha.18)

I hope this helps!

nprezant commented 1 year ago

This is great! Both methods seem to work. Based on a quick and dirty test with hyperfine reading about 500,000 records, the "fast" method is about 10 times faster than the "slow" one. But also in that test I left the structs alone -- fast method would probably be not quite as fast if I included the new Span<>().ToArray() on the arrays.

Just for reference: I was also able to get the offsets from a C script that printed out HOFFSET values for each struct member (offsetof, referenced in HDF5 Compound Types).

I think this resolves my issue. Thank you!

Apollo3zehn commented 1 year ago

This is a message I post to all recent issues: I have just renamed the project from HDF5.NET to PureHDF for my preparations of a soon to come beta release. Please note that the Nuget package name has also changed and can be found here now: https://www.nuget.org/packages/PureHDF.