Parser structure - Githubissues

DavisDevelopment commented 6 years ago

I'll read through the parser thoroughly when I get the chance, but in the meantime, @demurgos, as you've studied regular expressions enough to implement the entire ES spec, this seems like a query you'll make short work of.

In the process of parsing the regexp itself, is it strictly necessary (or beneficial) to have a separated tokenization/parsing flow, or could it be done in a single-pass manner? My guess is that single-pass would be faster, if possible, and it would also be easier to read imho. If you get the chance before I've read the code enough to answer this myself, I'd love to get your thoughts on it

demurgos commented 6 years ago

There is no separate lexer/tokeniser. My only "token" is Symbol which is equivalent to an optional character.

https://github.com/DavisDevelopment/hre/blob/4cc38c049e98c15845d7f94dced89c833b7e6745/src/hre/tokens/Symbol.hx#L3-L6

It could well be a nullable String. I only went with this dedicated data type because at the time I wrote the parser, I focused on correctness more than speed so I preferred to have one type per "concept". Having something representing a token makes it closer to classical parsers. I also prefer match expressions to null checks.

The parser itself is a top-down LL parser with a small lookahead. I don't remember the exact value of the lookahead. The worst lookahead may be in readQuantifierBlock where I see some backtracking. This should probably result in syntax errors instead. Removing backtracking there would turn the parser into an LL(2) (according to this comment).

The "lexer" corresponds to the 4 functions at the end of the parser: symbolAt, charAt, peek, peekChar. The difference between peek and peekChar is that the first one can be called at the end of the string while the other one fails if there are no more characters.

Just to clarify, the ES spec does not impose any representation for the tokens or AST. It provides grammar production rules but you can usually get away with fewer nodes in your AST.

I don't have specialized tokens to differentiate meaningful codepoints like left braces and dashes from simple literals. It would require lexer modes: inside or outside a character class. For example, a pipe outside a character class acts like an alternative separator, but inside a character class it is just a literal. So I only have this minimal Symbol token and handle escape sequences, decimal numbers, etc. at the parser level.

DavisDevelopment commented 6 years ago

Okay, thanks for the information. That clarifies quite a bit, and saves me a fair bit of time. If an alternate parser were written that had a dedicated Token type for differentiating meaningful codepoints, and used a lookahead model more similar to that found in the hscript library's parser, do you think that we might see some performance improvements? I've only dabbled with using hre just a bit so far, so I've certainly not had any issues with it yet, but I do think it's a neat library and I happen to simply love writing parsers from scratch and designing grammars. No room for designing grammar here, but my love for parsers could maybe prove useful.

PS. I also prefer match expressions to null checks. Sadly, my guess is that they're probably slower than null checks on more platforms than they're faster on :disappointed:

demurgos commented 6 years ago

As I said, the parser has a part with a bit of backtracking when matching quantifiers and you could use null checks, these are small gains but can be done easily. The structure itself is pretty good, since the syntax is well defined it's easy to hand-write the parser (it does not change much). Having a dedicated token with meaningful variants would force you to use more matching and probably result in worse perf. Not sure, but probably. hscript is more complex so having a higher abstraction level is useful to them.

The bottle neck is the matcher, not the parser. It's currently following the spec literally, even though the spec clearly says that it will to an inefficient matcher. I don't know how to support capture groups, but for regexps only used to check values, a huge win would be to use an NDFA. I never really looked into how capture groups are handled by efficient matchers. Even if we keep following the spec, refactoring to avoid the dozens of closures may help.

DavisDevelopment commented 6 years ago

I just did a quick google search for NDFA, and found this. Is that what you're talking about? If so, how would those concepts carry over to a parser? So sorry if that's a painfully stupid question

demurgos commented 6 years ago

Sorry for not being clear enough :s. Yes, I was referring to "Nondeterministic finite automaton". It was related to "The bottle neck is the matcher", not the parser.

You can convert any regex without capture groups or "complex assertions" (such as "followed by") to a finite automaton: a graph of states. You can then reduce duplication in this graph and represent it as a small table. Matching is then really efficient. Your set of active states only contains the initial state at the beginning. You consume the characters of the input and build the next set as the union of states reachable from the previous states using the current character. You stop when reaching an accepting state or exhausting the input.

I don't have any link at hand, but you can search for "regex to ndfa transformation".

PS: FYI, I refactored the lib to remove the dependencies. MR!5

DavisDevelopment commented 6 years ago

No problem, man. And yeah, that was a (mental?) typo; I did understand that NDFA was referring to the matcher, not the parser. And, wow, that's a pretty damn clever optimization. If I can properly wrap my head around it, I'd love to try my hand at implementing it

demurgos commented 6 years ago

I'm still thinking about areas that could be improved. A low-priority improvement may be to change the representation of the AST. If we don't need the AST once the regex is compiled, then there's no sense in changing the representation.

The current representation is a tree of heap-allocated objects. It means that traversing it has a great probability of cache miss. It could instead be flattened into a SoA.

class FlatTree {
  // Vector of pointers representing the tree structure
  var nodes: Array<Node>;
  // Data specific to atoms
  var atoms: Array<...>;
  // Data specific to characterClasses
  var characterClasses: Array<...>;
  // ...
}

class Node {
  // Enum representing the concrete type of the node
  var type: Int;
  // Index of the data associated with this node
  // for example for atoms it would be the index in flatTree.atoms
  var dataIdx: Int;
  // "Pointers" to parent and child nodes (index in `flatTree.nodes`)
  var parentIdx: Int;
  var firstChildIdx: Int;
  // The children are stored continuously so you can get them with
  // flatTree.nodes[firstChildIdx...(firstChildIdx + childCount)]
  var childCount: Int;
}

This is just rough sketch but it's possible to get a more compact representation. With raw access to memory you could go even further and could allocate static arrays in place instead of using vectors. I still fear that the overhead of Haxe's reflection data may trash the cache anyway so it may not even be worth the trouble to implement it. Do you know if there's a way to disable the emission of Haxe's metadata?

Other ideas regarding the matcher: having a DFA too (usually easier, may perform better), lazily compute the states, determine literal prefixes and do a quick first pass.

DavisDevelopment commented 6 years ago

I don't know about globally disabling Haxe's metadata across all targets or anything, but I'm sure I can help figure out a way to at least minimize its impact on performance. What targets in particular did you have in mind with regards to the overhead? Also, I'm unfamiliar with the term "cache miss", could you perhaps elaborate? EDIT: I googled it, and I see what you mean. I never even thought about that. Thanks for mentioning it; there are so many systems in my various Haxe projects that can be optimized pretty heavily (probably) by implementing this kind of structure.

DavisDevelopment commented 6 years ago

I'd be strongly in favor of implementing a flattened, compacted AST representation (if only for fun), but only if the compacted format can still be practically converted back to the hierarchical format for analysis and/or transformation. It occurred to me that perhaps it might be possible to create a system for "JIT"ing a compiled Regexp into some representation that exposes the same API, but uses Haxe's EReg under the hood. Multiple ERegs would be necessary for consistency across platforms, I'd guess, but it sounds doable in my head at least. Does that seem practical / maybe useful to you?

demurgos commented 6 years ago

Yes, using EReg or native regexes directly is an excellent idea. Basically, it would translate regexes to their native format when available. I already implemented this kind of system in Python once (to convert shell patterns to Python regexes), somehow it did not occur to me to do it here too. :P As long as the results is the same, it should work fine. The issue is basically about solving the differences across platforms.

Regarding the AST: yes, it would be an internal representation. The classes defined today would remain there and you should be able to convert from one representation to the other. Still, I feel it's low priority (but fun). (Cache miss was about the CPU cache: if you keep spatial locality of your data, you can use CPU caches and improve throughput.)

I released hre@0.2.0 (remove all dependencies).

DavisDevelopment commented 6 years ago

Nice. I'll have to perform the standard blood-sacrifice to cajole git into letting me merge in your changes. My fork is an "import"ed clone of the fork I made on GitLab. So, as far as I can tell, GitHub doesn't even know that the repo is a fork :p

PS. Maybe email me or something sometime, man. Seems like we've got a handful of shared interests with regards to dev stuff. I figure if we chat a bit, we'll probably find more stuff to collaborate on and brainstorm about than just hre

demurgos commented 6 years ago

Sorry for the delay, I'm having a busy period at the moment (with a few deadlines coming together) so I did not have time to come back to this. I'll get in touch with you once the situation settles down a bit. I should have more time for open source in roughly 2-3 weeks.

DavisDevelopment commented 6 years ago

No problem, man. I wound up taking a bit of a hiatus myself, and am replying to you mere minutes after noticing your reply 17 days after you sent it. Most of my open-source-related time/effort has been focused exclusively on my embedded database engine, which is beginning to approach a working prototype, thankfully. Here's hoping we both find some time to work on hre a bit sometime soon; it's an interesting project, and I think that with some performance improvements made to the matcher (and an abstraction layer to merge its API with EReg's, where desirable) it could be quite useful to a lot of people. I know for sure it could be very useful to me personally, since I've found myself seeming to "need" to be able to convert an EReg to its String representation on many occasions over the years. I always found some workaround, but with the underlying RegExp type being implemented in pure-haxe it would be a trivial thing to add a 'toString' (and maybe also a 'toSource') method.

DavisDevelopment / hre

Parser structure #1