[fyi] Parsing performance

LaurensRietveld commented 4 years ago

Just an fyi related to parsing performance and quality.

I ran graphy (in relax mode) on a gzip lod-a-lod file (see https://api.krr.triply.cc/krr/lod-a-lot/, and here for the download https://api.krr.triply.cc/krr/lod-a-lot/download.nt)

This file is gzipped, contains about 28 billion statements, and comes from the lod laundromat crawl (i.e. some syntactically a-typical statements).

The statements in this crawl already went through a few other parsers (including N3 and serd), so this is a nice test for atypical but valid linked data. As a consequence, it does not test the error handling of graphy.

The test script is as follows:

const fs = require('fs');
const nt_read = require('@graphy/content.nt.read');
const zlib = require('zlib')

let start = Date.now();
let count = 0;
fs.createReadStream('lodalod.nt.gz')
    .pipe(zlib.createUnzip())
    .on('error', console.error)
    .pipe(nt_read({relax:true}))
    .on('data', (y_quad) => {
        count++;
    })
    .on('error', console.error)
    .on('eof', () => {
        console.log('done!');
        console.log('count', count);
        console.log('duration', Date.now() - start);
    });

Results:

No parse errors :+1:
Performance: 1345 minutes (N3 took 1940 minutes).

blake-regalia commented 4 years ago

Thanks for the feedback! Just for baseline, do you happen to know the time it takes serd?

I've also been experimenting with multicore N-Triples reading and it scales well but requires random access to the input (e.g., fs.read), so reading from stream is not applicable. I usually end up parallelizing at a higher layer.

LaurensRietveld commented 4 years ago

I had to re-run the serd test a few times, as I was surprised by the outcome:

time zcat lodalod.nt.gz | serdi -l -q -i ntriples - > /dev/null

real    5653m53,032s
user    6168m25,191s
sys     142m22,200s

i.e., it's taking about 94 hours (this is with serd 0.30.2)

blake-regalia commented 4 years ago

@LaurensRietveld : I'd be curious to know how well the new multithreaded N-Triples parser performs for you. In testing, I was able to get 4 million quads/sec on a 28 cpu machine.

Using your example, you could run a CLI command like so:

$ time zcat lodalod.nt.gz | graphy scan -c nt / count > /dev/null

Or, if you prefer the API:

const fs = require('fs');
const nt_scan = require('@graphy/content.nt.scan');
const zlib = require('zlib');

let start = Date.now();
let ds_input = fs.createReadStream('lodalod.nt.gz')
    .pipe(zlib.createUnzip())
    .on('error', console.error);

nt_scan(ds_input, {
    run: /* syntax: js */ `
        (read, err, update, submit) => {
            let c_stmts = 0;

            return read({
                relax: true,  // or false

                data() {
                    c_stmts += 1;
                },

                error(e_read) {
                    err(e_read);
                },

                eof() {
                    submit(c_stmts);
                },
            });
        }
    `,
    error: console.error,
    report(count) {
        console.log('done!');
        console.log('count', count);
        console.log('duration', Date.now() - start);
    },
});

See 4.2.0 release details.

blake-regalia / graphy.js

[fyi] Parsing performance #23