blake-regalia / graphy.js

A collection of RDF libraries for JavaScript
https://graphy.link/
ISC License
163 stars 4 forks source link

Make graphy.js parsers the default in RDF.JS / rdf-ext? #14

Open RubenVerborgh opened 4 years ago

RubenVerborgh commented 4 years ago

Now that graphy.js is fully spec-compliant (and has been for some time), I think we should strongly consider making it the default parser of RDF/JS. There was definitely a point to an arms race many years ago, when I created N3.js, and it was able to pull off a performance difference of 2 magnitudes compared to the state of the art. Not only has that gap disappeared nowadays (thanks to a much faster V8, it seems), but also, graphy.js is simply much faster. In fact, I think the best I could achieve with N3.js is to be as fast as graphy.js; I see few options to get much better. And frankly, I wouldn't have the time anymore either.

That said, a couple of questions perhaps to understand the implications of switching to graphy.js.

I think that's all for now, some more questions might pop up. Thanks in advance for your insights.

rubensworks commented 4 years ago

how is code size compared to the state of the art?

I'm also interested in this to know whether or not we could/should make graphy's parsers the new defaults in Comunica.

RubenVerborgh commented 4 years ago

make graphy's parsers the new defaults in Comunica.

We definitely should! And suppose that graphy.js is too large, or there's another reason why we can't use it in the browser, we just ship N3.js to browsers (but do our benchmarking on Node.js of course). Thanks to RDF/JS, yay 🎉

blake-regalia commented 4 years ago

Interesting proposal! Let me answer a few points:

Validation and maxStringLength

the options { validate: true, maxStringLength: Infinity }.

  • Could it be made the default?

Absolutely. This is actually in the upcoming major release with a breaking change that removes the validate option and enables validation by default (the performance tradeoff is marginal) and in its place is a relax option (much like serd's lax option).

Also in next, maxStringLength is set to Infinity by default. The whole idea of this flag was to protect against invalid STRING_LITERAL_LONG_QUOTE tokens within (endless) streams, but I think this is a rare use-case.

Sticky regexes

sticky regexes cannot be set to start a a certain point, whereas with chopped-off strings, you can force to start at the beginning and fail fast.

The real strength of sticky regexes comes from the fact that you can set the .lastIndex property to start matching a string at a given index (which also acts as an implicit start anchor), rather than creating a new string every match with String#slice and going from there (this has huge cost savings).

For an example of the sticky anchored substring technique:

let r_world = /world/y;
r_world.lastIndex = 2;  // 'llo world!'
r_world.exec('hello world!');  // null

and

let r_world = /world/y;
r_world.lastIndex = 6;  // 'world!'
r_world.exec('hello world!');  // ["world", index: 6, input: "hello world!"]

Could you imagine any other downsides to sticky?

Setting .lastIndex on sticky regex is equivalent to slicing the input string before matching.

Here's a basic overview of how it reads "a", "here come a million characters"@en:

Stream boundaries

Has graphy.js been well-tested for arbitrary stream buffer boundaries?

Very good criteria. Yes, all positive test cases are run through one character at a time in order to test all stream boundaries. https://github.com/blake-regalia/graphy.js/blob/master/test/helper/reader.js#L102

Code size

Graphy is actually a collection of libraries, so for instance, require('@graphy/content.ttl.read') is the 'standalone' Turtle reader (it also depends on @graphy/core.data.factory). The main package graphy simply includes all these libraries for convenience and ships with a CLI tool.

However, I definitely see code size as a priority for browsers and agree that load time is significant to performance. Not much effort has been put into testing this yet.

RubenVerborgh commented 4 years ago

Absolutely. This is actually in the upcoming major release with a breaking change that removes the validate option and enables validation by default (the performance tradeoff is marginal) and in its place is a relax option (much like serd's lax option).

Perfect; agree that this is how it should be.

Also in next, maxStringLength is set to Infinity by default. The whole idea of this flag was to protect against invalid STRING_LITERAL_LONG_QUOTE tokens within (endless) streams, but I think this is a rare use-case.

Let's not pretend you haven't met @LaurensRietveld 😄

Related question: even with Infinity, do you guard against "bla \n\r (where the backslashed letters are the actual characters, not the escaped versions), i.e., a literal that can never properly terminate anymore, so no point waiting for the final quote?

The real strength of sticky regexes comes from the fact that you can set the .lastIndex property to start matching a string at a given index (which also acts as an implicit start anchor),

Oh gosh, yes. Misunderstanding, thanks for clearing that up.

Graphy is actually a collection of libraries, so for instance, require('@graphy/content.ttl.read') is the 'standalone' Turtle reader (it also depends on @graphy/core.data.factory).

Thanks, I somehow missed that.

However, I definitely see code size as a priority for browsers and agree that load time is significant to performance. Not much effort has been put into testing this yet.

Then @rubensworks and myself will find out within Comunica. The sharing code path should be an interesting one though, for both performance and size.

No further concerns from me; happy to support this as default RDF/JS implementation, pending the size comment (that should not be a blocker).

blake-regalia commented 4 years ago

do you guard against "bla \n\r

Exactly. Yes, this is currently handled in master branch but 3.2.2 actually did not check for this. Corresponds to this step in the example from above:

check unparsed text for invalid characters (e.g., newlines not allowed in STRING_LITERAL_QUOTE -- throw parsing error if invalid start of string token)

RubenVerborgh commented 4 years ago

I should probably retire N3.js 🙂 Except for the Notation3 parsing. Is that planned at some point? Other component left is N3Store—does graphy have an alternative?

blake-regalia commented 4 years ago

I should probably retire N3.js

Maybe someday in the distant future but for now I hope it remains maintained. Graphy has not seen a whole lot of mainstream usage yet.

Notation3 has not been planned yet. As for Store, DatasetTree is the graphy alternative, although I'd be curious what the differences in capabilities and performance are.