Open RubenVerborgh opened 4 years ago
how is code size compared to the state of the art?
I'm also interested in this to know whether or not we could/should make graphy's parsers the new defaults in Comunica.
make graphy's parsers the new defaults in Comunica.
We definitely should! And suppose that graphy.js is too large, or there's another reason why we can't use it in the browser, we just ship N3.js to browsers (but do our benchmarking on Node.js of course). Thanks to RDF/JS, yay 🎉
Interesting proposal! Let me answer a few points:
the options
{ validate: true, maxStringLength: Infinity }
.
- Could it be made the default?
Absolutely. This is actually in the upcoming major release with a breaking change that removes the validate
option and enables validation by default (the performance tradeoff is marginal) and in its place is a relax
option (much like serd's lax
option).
Also in next, maxStringLength
is set to Infinity
by default. The whole idea of this flag was to protect against invalid STRING_LITERAL_LONG_QUOTE tokens within (endless) streams, but I think this is a rare use-case.
sticky regexes cannot be set to start a a certain point, whereas with chopped-off strings, you can force to start at the beginning and fail fast.
The real strength of sticky regexes comes from the fact that you can set the .lastIndex
property to start matching a string at a given index (which also acts as an implicit start anchor), rather than creating a new string every match with String#slice
and going from there (this has huge cost savings).
For an example of the sticky anchored substring technique:
let r_world = /world/y;
r_world.lastIndex = 2; // 'llo world!'
r_world.exec('hello world!'); // null
and
let r_world = /world/y;
r_world.lastIndex = 6; // 'world!'
r_world.exec('hello world!'); // ["world", index: 6, input: "hello world!"]
Could you imagine any other downsides to sticky?
Setting .lastIndex
on sticky regex is equivalent to slicing the input string before matching.
Here's a basic overview of how it reads "a", "here come a million characters"@en
:
"
, goto string_literala"
, lookahead: ,
, goto post_object,
, consumed: whitespace, goto object_list"
, goto string_literalhere come a million characters"
, lookahead: @
, goto: datatype_or_lang@en
, consumed: whitespaceHas graphy.js been well-tested for arbitrary stream buffer boundaries?
Very good criteria. Yes, all positive test cases are run through one character at a time in order to test all stream boundaries. https://github.com/blake-regalia/graphy.js/blob/master/test/helper/reader.js#L102
Graphy is actually a collection of libraries, so for instance, require('@graphy/content.ttl.read')
is the 'standalone' Turtle reader (it also depends on @graphy/core.data.factory
). The main package graphy
simply includes all these libraries for convenience and ships with a CLI tool.
However, I definitely see code size as a priority for browsers and agree that load time is significant to performance. Not much effort has been put into testing this yet.
Absolutely. This is actually in the upcoming major release with a breaking change that removes the
validate
option and enables validation by default (the performance tradeoff is marginal) and in its place is arelax
option (much like serd'slax
option).
Perfect; agree that this is how it should be.
Also in next,
maxStringLength
is set toInfinity
by default. The whole idea of this flag was to protect against invalid STRING_LITERAL_LONG_QUOTE tokens within (endless) streams, but I think this is a rare use-case.
Let's not pretend you haven't met @LaurensRietveld 😄
Related question: even with Infinity
, do you guard against "bla \n\r
(where the backslashed letters are the actual characters, not the escaped versions), i.e., a literal that can never properly terminate anymore, so no point waiting for the final quote?
The real strength of sticky regexes comes from the fact that you can set the
.lastIndex
property to start matching a string at a given index (which also acts as an implicit start anchor),
Oh gosh, yes. Misunderstanding, thanks for clearing that up.
Graphy is actually a collection of libraries, so for instance,
require('@graphy/content.ttl.read')
is the 'standalone' Turtle reader (it also depends on@graphy/core.data.factory
).
Thanks, I somehow missed that.
However, I definitely see code size as a priority for browsers and agree that load time is significant to performance. Not much effort has been put into testing this yet.
Then @rubensworks and myself will find out within Comunica. The sharing code path should be an interesting one though, for both performance and size.
No further concerns from me; happy to support this as default RDF/JS implementation, pending the size comment (that should not be a blocker).
do you guard against
"bla \n\r
Exactly. Yes, this is currently handled in master branch but 3.2.2
actually did not check for this. Corresponds to this step in the example from above:
check unparsed text for invalid characters (e.g., newlines not allowed in STRING_LITERAL_QUOTE -- throw parsing error if invalid start of string token)
I should probably retire N3.js 🙂 Except for the Notation3 parsing. Is that planned at some point? Other component left is N3Store
—does graphy have an alternative?
I should probably retire N3.js
Maybe someday in the distant future but for now I hope it remains maintained. Graphy has not seen a whole lot of mainstream usage yet.
Notation3 has not been planned yet. As for Store, DatasetTree is the graphy alternative, although I'd be curious what the differences in capabilities and performance are.
Now that graphy.js is fully spec-compliant (and has been for some time), I think we should strongly consider making it the default parser of RDF/JS. There was definitely a point to an arms race many years ago, when I created N3.js, and it was able to pull off a performance difference of 2 magnitudes compared to the state of the art. Not only has that gap disappeared nowadays (thanks to a much faster V8, it seems), but also, graphy.js is simply much faster. In fact, I think the best I could achieve with N3.js is to be as fast as graphy.js; I see few options to get much better. And frankly, I wouldn't have the time anymore either.
That said, a couple of questions perhaps to understand the implications of switching to graphy.js.
I noticed that spec-compatible parsing is not the default, in the sense that one would have to pass the options
{ validate: true, maxStringLength: Infinity }
. Are there severe performance consequences of doing so? (I haven't seen any.) IsmaxStringLength
nothing more than a safety guard?NamedNode
has a validvalue
rather than having to check it.One of the reasons for the performance difference seems to be the use of sticky regular expressions. Back in the days (🦕), I remember having to make the decision with N3.js as well, but there as insufficient support for it (only Firefox if I remember correctly). Now we have it, and it seems to be faster than always chopping off the beginning of the string, like N3.js does. However, what has also stopped me in the past, is the fact that sticky regexes cannot be set to start a a certain point, whereas with chopped-off strings, you can force to start at the beginning and fail fast.
"a", "here come a million characters"@en
, would it search very far to find the language tag after the first string? I understand such examples are probably artificial, I just want to understand if there are any downsides to the sticky bit.Are there in general any disadvantages of the graphy.js parsers that you know about?
Has graphy.js been well-tested for arbitrary stream buffer boundaries? Basically all tests in https://github.com/rdfjs/N3.js/blob/b2ff96d35ed586fce1a02c567fb3ba9c10272598/test/N3Lexer-test.js#L155 that match
/streamOf/
. I would adapt them to graphy.js, but it doesn't have a separate lexer (likely another source of performance gains). Some of those tests are only the result of crazy usage like with LOD Laundromat files, i.e., very hard to imagine all special cases and possible splits. This one stands out in particular.Graphy is not an ES6 module, and thus not supporting tree shaking, so code would still be unnecessarily large in several cases, which would especially hurt browsers. Would you consider implementing that? (Or otherwise partitioning the code, for example through specific include paths?)
I think that's all for now, some more questions might pop up. Thanks in advance for your insights.