HDF-NI / hdf5.node

A node module for reading/writing the HDF5 file format.
MIT License
123 stars 40 forks source link

Error with handling variable length data (H5T_VLEN) #99

Open janblumenkamp opened 5 years ago

janblumenkamp commented 5 years ago

I have a dataset that was created with h5py and which contains variable length data (utilizing the H5T_VLEN type). The python script I used to generate it:

import numpy as np
import h5py

with h5py.File('testdata.hdf5', 'a') as hdf:
  if 'real' in hdf:
    del hdf['real']

  hdf_group = hdf.create_group('real')
  hdf_labels = hdf_group.create_dataset('labels', (3,), h5py.special_dtype(vlen = np.uint8))

  for i in range(3):
    labels = np.empty(i + 1, np.uint8)
    for j in range(i + 1):
      labels[j] = j
    hdf_labels[i] = labels

The output of h5dump:

HDF5 "testdata.hdf5" {
GROUP "/" {
   GROUP "real" {
      DATASET "labels" {
         DATATYPE  H5T_VLEN { H5T_STD_U8LE}
         DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
         DATA {
         (0): (0), (0, 1), (0, 1, 2)
         }
      }
   }
}
}

Reading the generated HDF file in JS:

const hdf5 = require('hdf5').hdf5;
const h5tb = require('hdf5').h5tb;

var Access = require('hdf5/lib/globals').Access;
var file = new hdf5.File('testdata.hdf5', Access.ACC_READ);
var group = file.openGroup('real');
var readBuffer=h5tb.getTableInfo(group.id, 'labels');
console.log(readBuffer);

And the output:

HDF5-DIAG: Error detected in HDF5 (1.10.4) thread 0:
  #000: H5Tfields.c line 63 in H5Tget_nmembers(): cannot return member number
    major: Invalid arguments to routine
    minor: Inappropriate type
  #001: H5Tfields.c line 104 in H5T_get_nmembers(): operation not supported for type class
    major: Invalid arguments to routine
    minor: Inappropriate type
{ nfields: 1213911376, nrecords: -781860838 }

where the numbers in the last line are different every time. What is the problem? I would be surprised if H5T_VLEN is not implemented, as it should also be used for strings?

rimmartin commented 5 years ago

Hi, h5tb.getTableInfo is for Tables like https://support.hdfgroup.org/HDF5/doc/HL/RM_H5TB.html. The python above created a dataset which probably doesn't have all the table sophistry.

Do you want the dimensions of your dataset before reading it? Maybe try

const dims = group.getDatasetDimensions('labels');

http://hdf-ni.github.io/hdf5.node/ref/groups.html

janblumenkamp commented 5 years ago

Hi, getDatasetDimensions correctly outputs [3], but readDataset outputs a similar error:

var data = h5lt.readDataset(group.id, 'labels');
                ^

SyntaxError: unsupported data type
    at Object.<anonymous> (hdfGenerator.js:23:17)
    at Module._compile (internal/modules/cjs/loader.js:722:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:733:10)
    at Module.load (internal/modules/cjs/loader.js:620:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:560:12)
    at Function.Module._load (internal/modules/cjs/loader.js:552:3)
    at Function.Module.runMain (internal/modules/cjs/loader.js:775:12)
    at startup (internal/bootstrap/node.js:300:19)
    at bootstrapNodeJSCore (internal/bootstrap/node.js:826:3)
rimmartin commented 5 years ago

Ah, ok; I'll add support for reading:-) Thank you, I have a python env to make the test

rimmartin commented 5 years ago

And also writing...

rimmartin commented 5 years ago

Hi, do you have other types you want vlen'ed?

Also the overall dimensions and rank you need covered? I want to support them all

janblumenkamp commented 5 years ago

Perfect, thanks! It would be great if any kind of tables can also be used with VLEN. Demo generation script:

import numpy as np
import h5py

label_dtype = np.dtype(
  [('type1', np.float),
   ('type2', np.float),
   ('type3', np.uint8),
   ('type4', np.uint16),
   ('type5', np.uint16)])

with h5py.File('testdata.hdf5', 'a') as hdf:
  if 'real' in hdf:
    del hdf['real']

  hdf_group = hdf.create_group('real')
  hdf_labels = hdf_group.create_dataset('labels', (3,), h5py.special_dtype(vlen = label_dtype))

  for i in range(3):
    labels = np.empty(i + 1, label_dtype)
    for j in range(i + 1):
      labels[j]['type1'] = j
      labels[j]['type2'] = j + 1
      labels[j]['type3'] = j + 2
      labels[j]['type4'] = j + 3
      labels[j]['type5'] = j + 4
    hdf_labels[i] = labels

h5dump output:

HDF5 "testdata.hdf5" {
GROUP "/" {
   GROUP "real" {
      DATASET "labels" {
         DATATYPE  H5T_VLEN { H5T_COMPOUND {
            H5T_IEEE_F64LE "type1";
            H5T_IEEE_F64LE "type2";
            H5T_STD_U8LE "type3";
            H5T_STD_U16LE "type4";
            H5T_STD_U16LE "type5";
         }}
         DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
         DATA {
         (0): ({
                  0,
                  1,
                  2,
                  3,
                  4
               }),
         (1): ({
                  0,
                  1,
                  2,
                  3,
                  4
               }, {
                  1,
                  2,
                  3,
                  4,
                  5
               }),
         (2): ({
                  0,
                  1,
                  2,
                  3,
                  4
               }, {
                  1,
                  2,
                  3,
                  4,
                  5
               }, {
                  2,
                  3,
                  4,
                  5,
                  6
               })
         }
      }
   }
}
}

But I think I will use another approach for now. Maybe I will get back to this really nice module in the future. So no hurry for me, but this will probably also be very useful for other users who want to use the module with hdf files generated with h5py :)

Regarding the rank, it would be helpful if any kind of rank can be used (if I understand correctly the maximum rank you currently support is 4?), but then this would probably be a different issue.