Closed LaurensRietveld closed 4 years ago
Thanks for the feedback! Just for baseline, do you happen to know the time it takes serd?
I've also been experimenting with multicore N-Triples reading and it scales well but requires random access to the input (e.g., fs.read
), so reading from stream is not applicable. I usually end up parallelizing at a higher layer.
I had to re-run the serd test a few times, as I was surprised by the outcome:
time zcat lodalod.nt.gz | serdi -l -q -i ntriples - > /dev/null
real 5653m53,032s
user 6168m25,191s
sys 142m22,200s
i.e., it's taking about 94 hours (this is with serd 0.30.2)
@LaurensRietveld : I'd be curious to know how well the new multithreaded N-Triples parser performs for you. In testing, I was able to get 4 million quads/sec on a 28 cpu machine.
Using your example, you could run a CLI command like so:
$ time zcat lodalod.nt.gz | graphy scan -c nt / count > /dev/null
Or, if you prefer the API:
const fs = require('fs');
const nt_scan = require('@graphy/content.nt.scan');
const zlib = require('zlib');
let start = Date.now();
let ds_input = fs.createReadStream('lodalod.nt.gz')
.pipe(zlib.createUnzip())
.on('error', console.error);
nt_scan(ds_input, {
run: /* syntax: js */ `
(read, err, update, submit) => {
let c_stmts = 0;
return read({
relax: true, // or false
data() {
c_stmts += 1;
},
error(e_read) {
err(e_read);
},
eof() {
submit(c_stmts);
},
});
}
`,
error: console.error,
report(count) {
console.log('done!');
console.log('count', count);
console.log('duration', Date.now() - start);
},
});
Just an fyi related to parsing performance and quality.
I ran graphy (in
relax
mode) on a gzip lod-a-lod file (see https://api.krr.triply.cc/krr/lod-a-lot/, and here for the download https://api.krr.triply.cc/krr/lod-a-lot/download.nt)This file is gzipped, contains about 28 billion statements, and comes from the lod laundromat crawl (i.e. some syntactically a-typical statements).
The statements in this crawl already went through a few other parsers (including N3 and serd), so this is a nice test for atypical but valid linked data. As a consequence, it does not test the error handling of graphy.
The test script is as follows:
Results: