Closed jacoscaz closed 7 years ago
I'll experiment. Have to see if it works with the slash on 'dataset'
Should work without the slash; just the name
Slash or no slash, I keep getting the same error when I increase the array's length from 10000 to 100000. I'll try to bisect until I find the exact length that triggers this behaviour.
It breaks when going from a length of 73901 to 73902.
Also, when I examine the file with h5dump -d /datasetName
, I'm getting the JSON representation of the whole array as the first point in the dataset.
EDIT: I was wrong, apologies. It looks like JSON but it's not JSON.
This is the header for a Uint16 dataset within the same file:
DATASET "/station_id" {
DATATYPE H5T_STD_U16LE
DATASPACE SIMPLE { ( 100 ) / ( 100 ) }
ATTRIBUTE "type" {
DATATYPE H5T_STD_U32LE
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
}
}
This is the header for the string dataset:
DATASET "/station_name" {
DATATYPE H5T_ARRAY { [100] H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
} }
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
}
My code is based on the tutorial for variable length strings here: http://hdf-ni.github.io/hdf5.node/tut/dataset-tutorial.html .
What are you filling in the Array entries with? I suppose for a test a random string generator could b used. Or find a text document wth over 80,000 lines... Testing
The following code
var fs = require('fs');
var hdf5 = require('../common/hdf5').hdf5;
var h5lt = require('../common/hdf5').h5lt;
var h5gl = require('../common/hdf5').h5gl;
var path = require('path');
var shortid = require('shortid');
var filePath = path.join(__dirname, 'test-hdf5.h5');
var file = new hdf5.File(filePath, h5gl.Access.ACC_TRUNC);
var length = 10;
var dataset = new Array(length);
for (var i = 0; i < length; i++) {
dataset[i] = shortid.generate();
}
h5lt.makeDataset(file.id, 'test', dataset);
file.close();
produces a file that when examined through h5dump -d /test --stride 1 --start 0 --count 1 products/test-hdf5.h5
shows the following:
HDF5 "products/test-hdf5.h5" {
DATASET "/test" {
DATATYPE H5T_ARRAY { [10] H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
} }
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
SUBSET {
START ( 0 );
STRIDE ( 1 );
COUNT ( 1 );
BLOCK ( 1 );
DATA {
(0): [ "r1_oscv0", "SkeOjocDR", "Sy-doo5DC", "SJfujjcwR", "Bkmuio9PA", "BJNuoo9D0", "rkBdsjqPA", "ry8OssqvR", "ryv_ii5wR", "ryd_ojqw0" ]
}
}
}
}
This is what I was referring to before - it looks like the entire array of strings is being stored as the first point the dataset rather than each string being treated as a separate point.
I got a test case setup by reading in a pdb of the rat liver molecule from https://pdb101.rcsb.org/motm/114 It's close to a million lines and cuts out between 70000 and 80000.
So able to repeat and test
It might have to do with some handle limit on linux
For example on my ubuntu
cat /proc/sys/fs/file-max
808097
I guess there are two sides to this - the cut out and the array of strings vs strings dataset. Happy to contribute in any way I can. Feel free to send tests my way. I'll check the fs limit as soon as I get back home. On 9 Oct 2016 6:28 p.m., rimmartin notifications@github.com wrote:I got a test case setup by reading in a pdb of the rat liver molecule from https://pdb101.rcsb.org/motm/114 It's close to a million lines and cuts out between 70000 and 80000.
So able to repeat and test
—You are receiving this because you authored the thread.Reply to this email directly, view it on GitHub, or mute the thread.
filename = '/home/jacopo/data-backend/products/gistemp/gistemp.h5', file descriptor = 12, errno = 14, error message = 'Bad address', buf = 0x55c61fcac378, total write size = 422496, bytes this sub-write = 422496, bytes actually written = 18446744073709551615, offset = 1179648
filename = './roothaan.h5', file descriptor = 9, errno = 14, error message = 'Bad address', buf = 0x487f858, total write size = 98400, bytes this sub-write = 98400, bytes actually written = 18446744073709551615, offset = 1183744
The actually written is crazy in both your test and mine; but there is a bad address as the error message
With my test as-is, i.e. using shortid.generate()
, I can go up to a length of 73862. A length of 73863 breaks one every two runs (more or less) and 73864 always breaks.
However, switching to the following filler loop only got me up to 73820, breaking on all runs from 73821 going upward.
for (var i = 0; i < length; i++) {
dataset[i] = 'hello ' + i;
}
Lengthening the string to 'helloworldhelloworld ' + i
still got me up to 73820. Curiously enough, inverting the order to i + ' hello'
got me to a different number, 73746.
There must be a pattern but I can't see it ATM. Perhaps we're hitting some kind of limit on how big an array of strings can be within an array of strings
-typed dataset (even though we shouldn't be getting an array of strings
-typed dataset in the first place).
PS: My file-max
is 200676
.
PS: Can I store fixed-length strings using node.hdf5?
Yea I was testing with
dataset[i] = 'hello ' + '\0';
It feels like some limit is being hit; a heap or stack. Something. I may put the question to the hdfgroup after I search their email.
Yes, fixed was done for table columns. Let me test some; to make it clean I may add an option:
h5lt.makeDataset(file.id, '/dataset', dataset, {fixed-width=7});
for example
Will continue to look at large sizes of everything to look for breaks in the system
That'd be lovely. Happy to test any solution you come up with.
fixed width is coming. Need to test and work on the reading back to javascript.
For writing there is no need to fixed the length of the strings; just know the maximum length of them all. If this is too short for one string entry in the Array an exception will be thrown from the native side to insure data doesn't get messed up
h5lt.makeDataset(group.id, "Rat Liver", lines, {fixed_width: maxLength});
should commit this evening
Wonderful, wonderful, wonderful.
Hi, sorry for delay.
h5lt.makeDataset(group.id, "Rat Liver", lines, {fixed_width: maxLength});
now saves nearly 1 million lines from a text file for the rat liver pdb chemistry model. The fixed width is 80 in this case.
Need to test reading back to javascript yet
I'm building their c examples and extending them to work with large data. Otherwise I've mirrored these examples in this project. Their docs don't say chunking is necessary but may need to
fixed width is now working. Tested on about a million entries and ~74 Mb h5 file
h5lt.makeDataset(group.id, "Rat Liver", lines, {fixed_width: maxLength});
var readArray=h5lt.readDataset(group.id, "Rat Liver");
where the array is filled from a text file read and split on"\n"
const lineArr = ratLiver.trim().split("\n");
var lines=new Array(lineArr.length);
var index=0;
var maxLength=0;
/* Loop over every line. */
lineArr.forEach(function (line) {
if(index<lines.length){
lines[index]=line;
if(maxLength<line.length)maxLength=line.length;
}
index++;
});
Relooking at variable length
variable length io is now working
This works:
This doesn't:
Stacktrace:
I've tried to experiment a bit to no avail. Any ideas? It almost looks like it's running out of memory, even though there's plenty of disk space.