HDF-NI / hdf5.node

A node module for reading/writing the HDF5 file format.
MIT License
123 stars 40 forks source link

Cannot write string dataset w/ many points #32

Closed jacoscaz closed 7 years ago

jacoscaz commented 7 years ago

This works:

var dataset = new Array(10000);
h5lt.makeDataset(file.id, '/dataset', dataset);

This doesn't:

var dataset = new Array(100000);
h5lt.makeDataset(file.id, '/dataset', dataset);

Stacktrace:

HDF5-DIAG: Error detected in HDF5 (1.8.17) thread 0:
  #000: H5Dio.c line 271 in H5Dwrite(): can't prepare for writing data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 352 in H5D__pre_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dio.c line 769 in H5D__write(): unable to initialize storage
    major: Dataset
    minor: Unable to initialize object
  #003: H5Dint.c line 1836 in H5D__alloc_storage(): unable to initialize dataset with fill value
    major: Dataset
    minor: Unable to initialize object
  #004: H5Dint.c line 1898 in H5D__init_storage(): unable to allocate all chunks of dataset
    major: Dataset
    minor: Unable to initialize object
  #005: H5Dcontig.c line 315 in H5D__contig_fill(): unable to write fill value to dataset
    major: Dataset
    minor: Unable to initialize object
  #006: H5Dcontig.c line 618 in H5D__contig_write_one(): vector write failed
    major: Low-level I/O
    minor: Write failed
  #007: H5Dcontig.c line 1206 in H5D__contig_writevv(): can't perform vectorized sieve buffer write
    major: Dataset
    minor: Can't operate on object
  #008: H5VM.c line 1457 in H5VM_opvv(): can't perform operation
    major: Internal error (too specific to document in detail)
    minor: Can't operate on object
  #009: H5Dcontig.c line 952 in H5D__contig_writevv_sieve_cb(): block write failed
    major: Dataset
    minor: Write failed
  #010: H5Fio.c line 171 in H5F_block_write(): write through metadata accumulator failed
    major: Low-level I/O
    minor: Write failed
  #011: H5Faccum.c line 825 in H5F__accum_write(): file write failed
    major: Low-level I/O
    minor: Write failed
  #012: H5FDint.c line 260 in H5FD_write(): driver write request failed
    major: Virtual File Layer
    minor: Write failed
  #013: H5FDsec2.c line 802 in H5FD_sec2_write(): file write failed: time = Sat Oct  8 00:58:40 2016
, filename = '/home/jacopo/data-backend/products/gistemp/gistemp.h5', file descriptor = 12, errno = 14, error message = 'Bad address', buf = 0x55c61fcac378, total write size = 422496, bytes this sub-write = 422496, bytes actually written = 18446744073709551615, offset = 1179648
    major: Low-level I/O
    minor: Write failed
Unhandled rejection SyntaxError: failed to make var len dataset
    at SyntaxError (native)
    at createHDF5Product (/home/jacopo/data-backend/scripts/gistemp-process.js:111:8)
    at tryCatcher (/home/jacopo/data-backend/node_modules/bluebird/js/release/util.js:16:23)

I've tried to experiment a bit to no avail. Any ideas? It almost looks like it's running out of memory, even though there's plenty of disk space.

rimmartin commented 7 years ago

I'll experiment. Have to see if it works with the slash on 'dataset'

rimmartin commented 7 years ago

Should work without the slash; just the name

jacoscaz commented 7 years ago

Slash or no slash, I keep getting the same error when I increase the array's length from 10000 to 100000. I'll try to bisect until I find the exact length that triggers this behaviour.

jacoscaz commented 7 years ago

It breaks when going from a length of 73901 to 73902.

Also, when I examine the file with h5dump -d /datasetName, I'm getting the JSON representation of the whole array as the first point in the dataset.

EDIT: I was wrong, apologies. It looks like JSON but it's not JSON.

This is the header for a Uint16 dataset within the same file:

DATASET "/station_id" {
   DATATYPE  H5T_STD_U16LE
   DATASPACE  SIMPLE { ( 100 ) / ( 100 ) }
   ATTRIBUTE "type" {
      DATATYPE  H5T_STD_U32LE
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
   }
}

This is the header for the string dataset:

DATASET "/station_name" {
   DATATYPE  H5T_ARRAY { [100] H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   } }
   DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
}

My code is based on the tutorial for variable length strings here: http://hdf-ni.github.io/hdf5.node/tut/dataset-tutorial.html .

rimmartin commented 7 years ago

What are you filling in the Array entries with? I suppose for a test a random string generator could b used. Or find a text document wth over 80,000 lines... Testing

jacoscaz commented 7 years ago

The following code

var fs = require('fs');
var hdf5 = require('../common/hdf5').hdf5;
var h5lt = require('../common/hdf5').h5lt;
var h5gl = require('../common/hdf5').h5gl;
var path = require('path');
var shortid = require('shortid');
var filePath = path.join(__dirname, 'test-hdf5.h5');
var file = new hdf5.File(filePath, h5gl.Access.ACC_TRUNC);
var length = 10;
var dataset = new Array(length);
for (var i = 0; i < length; i++) {
  dataset[i] = shortid.generate();
}
h5lt.makeDataset(file.id, 'test', dataset);
file.close();

produces a file that when examined through h5dump -d /test --stride 1 --start 0 --count 1 products/test-hdf5.h5 shows the following:

HDF5 "products/test-hdf5.h5" {
DATASET "/test" {
   DATATYPE  H5T_ARRAY { [10] H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   } }
   DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
   SUBSET {
      START ( 0 );
      STRIDE ( 1 );
      COUNT ( 1 );
      BLOCK ( 1 );
      DATA {
      (0): [ "r1_oscv0", "SkeOjocDR", "Sy-doo5DC", "SJfujjcwR", "Bkmuio9PA", "BJNuoo9D0", "rkBdsjqPA", "ry8OssqvR", "ryv_ii5wR", "ryd_ojqw0" ]
      }
   }
}
}

This is what I was referring to before - it looks like the entire array of strings is being stored as the first point the dataset rather than each string being treated as a separate point.

rimmartin commented 7 years ago

I got a test case setup by reading in a pdb of the rat liver molecule from https://pdb101.rcsb.org/motm/114 It's close to a million lines and cuts out between 70000 and 80000.

So able to repeat and test

rimmartin commented 7 years ago

It might have to do with some handle limit on linux

rimmartin commented 7 years ago

For example on my ubuntu

cat /proc/sys/fs/file-max
808097
jacoscaz commented 7 years ago

I guess there are two sides to this - the cut out and the array of strings vs strings dataset. Happy to contribute in any way I can. Feel free to send tests my way. I'll check the fs limit as soon as I get back home. On 9 Oct 2016 6:28 p.m., rimmartin notifications@github.com wrote:I got a test case setup by reading in a pdb of the rat liver molecule from https://pdb101.rcsb.org/motm/114 It's close to a million lines and cuts out between 70000 and 80000.

So able to repeat and test

—You are receiving this because you authored the thread.Reply to this email directly, view it on GitHub, or mute the thread.

rimmartin commented 7 years ago
filename = '/home/jacopo/data-backend/products/gistemp/gistemp.h5', file descriptor = 12, errno = 14, error message = 'Bad address', buf = 0x55c61fcac378, total write size = 422496, bytes this sub-write = 422496, bytes actually written = 18446744073709551615, offset = 1179648
filename = './roothaan.h5', file descriptor = 9, errno = 14, error message = 'Bad address', buf = 0x487f858, total write size = 98400, bytes this sub-write = 98400, bytes actually written = 18446744073709551615, offset = 1183744

The actually written is crazy in both your test and mine; but there is a bad address as the error message

jacoscaz commented 7 years ago

With my test as-is, i.e. using shortid.generate(), I can go up to a length of 73862. A length of 73863 breaks one every two runs (more or less) and 73864 always breaks.

However, switching to the following filler loop only got me up to 73820, breaking on all runs from 73821 going upward.

for (var i = 0; i < length; i++) {
  dataset[i] = 'hello ' + i;
}

Lengthening the string to 'helloworldhelloworld ' + i still got me up to 73820. Curiously enough, inverting the order to i + ' hello' got me to a different number, 73746.

There must be a pattern but I can't see it ATM. Perhaps we're hitting some kind of limit on how big an array of strings can be within an array of strings-typed dataset (even though we shouldn't be getting an array of strings-typed dataset in the first place).

PS: My file-max is 200676. PS: Can I store fixed-length strings using node.hdf5?

rimmartin commented 7 years ago

Yea I was testing with

   dataset[i] = 'hello ' + '\0';

It feels like some limit is being hit; a heap or stack. Something. I may put the question to the hdfgroup after I search their email.

Yes, fixed was done for table columns. Let me test some; to make it clean I may add an option:

h5lt.makeDataset(file.id, '/dataset', dataset, {fixed-width=7});

for example

Will continue to look at large sizes of everything to look for breaks in the system

jacoscaz commented 7 years ago

That'd be lovely. Happy to test any solution you come up with.

rimmartin commented 7 years ago

fixed width is coming. Need to test and work on the reading back to javascript.

For writing there is no need to fixed the length of the strings; just know the maximum length of them all. If this is too short for one string entry in the Array an exception will be thrown from the native side to insure data doesn't get messed up

h5lt.makeDataset(group.id, "Rat Liver", lines, {fixed_width: maxLength});

should commit this evening

jacoscaz commented 7 years ago

Wonderful, wonderful, wonderful.

rimmartin commented 7 years ago

Hi, sorry for delay.

    h5lt.makeDataset(group.id, "Rat Liver", lines, {fixed_width: maxLength});

now saves nearly 1 million lines from a text file for the rat liver pdb chemistry model. The fixed width is 80 in this case.

Need to test reading back to javascript yet

rimmartin commented 7 years ago

I'm building their c examples and extending them to work with large data. Otherwise I've mirrored these examples in this project. Their docs don't say chunking is necessary but may need to

rimmartin commented 7 years ago

fixed width is now working. Tested on about a million entries and ~74 Mb h5 file

    h5lt.makeDataset(group.id, "Rat Liver", lines, {fixed_width: maxLength});
    var readArray=h5lt.readDataset(group.id, "Rat Liver");

where the array is filled from a text file read and split on"\n"

    const lineArr          =  ratLiver.trim().split("\n");
    var lines=new Array(lineArr.length);
    var index=0;
    var maxLength=0;
    /* Loop over every line. */
    lineArr.forEach(function (line) {
        if(index<lines.length){
        lines[index]=line;
        if(maxLength<line.length)maxLength=line.length;
        }
        index++;
    });

Relooking at variable length

rimmartin commented 7 years ago

variable length io is now working