Make graphy.js parsers the default in RDF.JS / rdf-ext?

RubenVerborgh commented 4 years ago

Now that graphy.js is fully spec-compliant (and has been for some time), I think we should strongly consider making it the default parser of RDF/JS. There was definitely a point to an arms race many years ago, when I created N3.js, and it was able to pull off a performance difference of 2 magnitudes compared to the state of the art. Not only has that gap disappeared nowadays (thanks to a much faster V8, it seems), but also, graphy.js is simply much faster. In fact, I think the best I could achieve with N3.js is to be as fast as graphy.js; I see few options to get much better. And frankly, I wouldn't have the time anymore either.

That said, a couple of questions perhaps to understand the implications of switching to graphy.js.

I noticed that spec-compatible parsing is not the default, in the sense that one would have to pass the options { validate: true, maxStringLength: Infinity }. Are there severe performance consequences of doing so? (I haven't seen any.) Is maxStringLength nothing more than a safety guard?
- Could it be made the default?
- The reason I insist on spec-compatibility, is that we want to avoid downstream code having to deal with invalid RDF; e.g., writers should be able to trust that a NamedNode has a valid value rather than having to check it.
One of the reasons for the performance difference seems to be the use of sticky regular expressions. Back in the days (🦕), I remember having to make the decision with N3.js as well, but there as insufficient support for it (only Firefox if I remember correctly). Now we have it, and it seems to be faster than always chopping off the beginning of the string, like N3.js does. However, what has also stopped me in the past, is the fact that sticky regexes cannot be set to start a a certain point, whereas with chopped-off strings, you can force to start at the beginning and fail fast.
- Are there any bad cases for graphy.js? Like for instance, if I had a Turtle file like "a", "here come a million characters"@en, would it search very far to find the language tag after the first string? I understand such examples are probably artificial, I just want to understand if there are any downsides to the sticky bit.
- Could you imagine any other downsides to sticky?
Are there in general any disadvantages of the graphy.js parsers that you know about?
- Any worst cases? Any undesirable properties?
Has graphy.js been well-tested for arbitrary stream buffer boundaries? Basically all tests in https://github.com/rdfjs/N3.js/blob/b2ff96d35ed586fce1a02c567fb3ba9c10272598/test/N3Lexer-test.js#L155 that match /streamOf/. I would adapt them to graphy.js, but it doesn't have a separate lexer (likely another source of performance gains). Some of those tests are only the result of crazy usage like with LOD Laundromat files, i.e., very hard to imagine all special cases and possible splits. This one stands out in particular.
Graphy is not an ES6 module, and thus not supporting tree shaking, so code would still be unnecessarily large in several cases, which would especially hurt browsers. Would you consider implementing that? (Or otherwise partitioning the code, for example through specific include paths?)
- Related to this question: how is code size compared to the state of the art?
- I understand there is some replication across different parser implementations. That should be gzipped away, but might not be if the minifier assigns different names (very likely). However, even if gzipped away, it is actual code size that determines performance these days.
- Is it meaningful to reuse more (I can't imagine it hurting performance badly in most cases, and let's not reuse where it does)?

I think that's all for now, some more questions might pop up. Thanks in advance for your insights.

rubensworks commented 4 years ago

how is code size compared to the state of the art?

I'm also interested in this to know whether or not we could/should make graphy's parsers the new defaults in Comunica.

RubenVerborgh commented 4 years ago

make graphy's parsers the new defaults in Comunica.

We definitely should! And suppose that graphy.js is too large, or there's another reason why we can't use it in the browser, we just ship N3.js to browsers (but do our benchmarking on Node.js of course). Thanks to RDF/JS, yay 🎉

blake-regalia commented 4 years ago

Interesting proposal! Let me answer a few points:

Validation and maxStringLength

the options { validate: true, maxStringLength: Infinity }.

Could it be made the default?

Absolutely. This is actually in the upcoming major release with a breaking change that removes the validate option and enables validation by default (the performance tradeoff is marginal) and in its place is a relax option (much like serd's lax option).

Also in next, maxStringLength is set to Infinity by default. The whole idea of this flag was to protect against invalid STRING_LITERAL_LONG_QUOTE tokens within (endless) streams, but I think this is a rare use-case.

Sticky regexes

sticky regexes cannot be set to start a a certain point, whereas with chopped-off strings, you can force to start at the beginning and fail fast.

The real strength of sticky regexes comes from the fact that you can set the .lastIndex property to start matching a string at a given index (which also acts as an implicit start anchor), rather than creating a new string every match with String#slice and going from there (this has huge cost savings).

For an example of the sticky anchored substring technique:

let r_world = /world/y;
r_world.lastIndex = 2;  // 'llo world!'
r_world.exec('hello world!');  // null

and

let r_world = /world/y;
r_world.lastIndex = 6;  // 'world!'
r_world.exec('hello world!');  // ["world", index: 6, input: "hello world!"]

Could you imagine any other downsides to sticky?

Setting .lastIndex on sticky regex is equivalent to slicing the input string before matching.

Here's a basic overview of how it reads "a", "here come a million characters"@en:

matched: ", goto string_literal
matched: a", lookahead: ,, goto post_object
matched: ,, consumed: whitespace, goto object_list
matched: ", goto string_literal
no tokens matched while expecting string_literal (rest of million chars in next chunk)
check unparsed text for invalid characters (e.g., newlines not allowed in STRING_LITERAL_QUOTE -- throw parsing error if invalid start of string token)
slice and save unparsed text, set resume state to string_literal and read from input stream
matched: here come a million characters", lookahead: @, goto: datatype_or_lang
matched: @en, consumed: whitespace

Stream boundaries

Has graphy.js been well-tested for arbitrary stream buffer boundaries?

Very good criteria. Yes, all positive test cases are run through one character at a time in order to test all stream boundaries. https://github.com/blake-regalia/graphy.js/blob/master/test/helper/reader.js#L102

Code size

Graphy is actually a collection of libraries, so for instance, require('@graphy/content.ttl.read') is the 'standalone' Turtle reader (it also depends on @graphy/core.data.factory). The main package graphy simply includes all these libraries for convenience and ships with a CLI tool.

However, I definitely see code size as a priority for browsers and agree that load time is significant to performance. Not much effort has been put into testing this yet.

RubenVerborgh commented 4 years ago

Absolutely. This is actually in the upcoming major release with a breaking change that removes the validate option and enables validation by default (the performance tradeoff is marginal) and in its place is a relax option (much like serd's lax option).

Perfect; agree that this is how it should be.

Also in next, maxStringLength is set to Infinity by default. The whole idea of this flag was to protect against invalid STRING_LITERAL_LONG_QUOTE tokens within (endless) streams, but I think this is a rare use-case.

Let's not pretend you haven't met @LaurensRietveld 😄

Related question: even with Infinity, do you guard against "bla \n\r (where the backslashed letters are the actual characters, not the escaped versions), i.e., a literal that can never properly terminate anymore, so no point waiting for the final quote?

The real strength of sticky regexes comes from the fact that you can set the .lastIndex property to start matching a string at a given index (which also acts as an implicit start anchor),

Oh gosh, yes. Misunderstanding, thanks for clearing that up.

Graphy is actually a collection of libraries, so for instance, require('@graphy/content.ttl.read') is the 'standalone' Turtle reader (it also depends on @graphy/core.data.factory).

Thanks, I somehow missed that.

However, I definitely see code size as a priority for browsers and agree that load time is significant to performance. Not much effort has been put into testing this yet.

Then @rubensworks and myself will find out within Comunica. The sharing code path should be an interesting one though, for both performance and size.

No further concerns from me; happy to support this as default RDF/JS implementation, pending the size comment (that should not be a blocker).

blake-regalia commented 4 years ago

do you guard against "bla \n\r

Exactly. Yes, this is currently handled in master branch but 3.2.2 actually did not check for this. Corresponds to this step in the example from above:

check unparsed text for invalid characters (e.g., newlines not allowed in STRING_LITERAL_QUOTE -- throw parsing error if invalid start of string token)

RubenVerborgh commented 4 years ago

I should probably retire N3.js 🙂 Except for the Notation3 parsing. Is that planned at some point? Other component left is N3Store—does graphy have an alternative?

blake-regalia commented 4 years ago

I should probably retire N3.js

Maybe someday in the distant future but for now I hope it remains maintained. Graphy has not seen a whole lot of mainstream usage yet.

Notation3 has not been planned yet. As for Store, DatasetTree is the graphy alternative, although I'd be curious what the differences in capabilities and performance are.

blake-regalia / graphy.js