Validark / Accelerated-Zig-Parser

A high-throughput parser for the Zig programming language.
Other
57 stars 0 forks source link

more crazy idea for benchmarking: compare against optimal AST structure (found by applying simplex) #3

Open matu3ba opened 1 year ago

matu3ba commented 1 year ago

A relative optimal AST structure can be found from computing simplex with the derived constrains on the first AST parse + copy things over.

See also https://en.wikipedia.org/wiki/Simplex_algorithm. The general idea of non-recursive tree traversal is given here: https://stackoverflow.com/questions/28544980/data-oriented-tree-traversal-without-recursion/28616278#28616278.

I got the idea from user Prokop on discord, which was asking "is the adjacency of ast nodes in the node list used somehow to store information?".

Validark commented 1 year ago

I'm not sure what your first paragraph means, sorry. I would be happy to consider what you are talking about, it's just that you are referring to things I don't know about. It sounds to me like Simplex is used to mathematically optimize a systems of equations? I'm not sure how that is applicable to an AST.

Is non-recursive tree traversal better? I think it depends. Regardless, I'm mainly focusing on producing the AST as the next part of this project. How someone consumes it is up to them, they could use recursion or simulate it in a loop. I could provide an iterator helper if desired.

As far as data being encoded by layout goes, the answer is yes, definitely. The reason why this hasn't been done yet is because you have to be smart about how you optimize the construction of the AST. Generating an implicit or semi-implicit AST laid out in pre-order DFS is a little difficult to do efficiently.

The problem is mainly to do with how operator precedence determines the AST hierarchy and thus how it should be laid out in memory. E.g. the infix expression a+b*c-d/e in pre-order is -+a*bc/de. You can see that the - traveled to the front of the expression, which is a bit of a problem. See https://ts-ast-viewer.com/#code/IYagRgVAxgtAJgegKZA for an AST view of the aforementioned expression.

The conventional algorithm to convert infix expressions to a more amenable form is called Shunting Yard when implemented with explicit stacks. The Zig compiler uses an equivalent technique called Pratt Parsing which uses the program stack via recursion instead. The standard forms of this algorithm can only get you a Postfix ordering, which is how the Zig AST is currently laid out in memory.

In order to get a prefix ordering, we have to do one of the following:

  1. Do the classic shunting yard/Pratt parse backwards by parsing the array of tokens backwards. This should be the most efficient technique in terms of getting the AST in the right order, however I'm not certain how efficient/feasible it is to parse Zig (tokens) backwards.
  2. Parse forwards, then go backwards somehow on each individual statement. To me this sounds like the worst of both worlds.
  3. Produce an AST for each statement (in the way it's currently done), then traverse it and re-order it to use less memory.
  4. Use an adapted version of Shunting Yard/Pratt Parsing which allows one to parse forwards by utilizing something equivalent to string concatenation. https://stackoverflow.com/a/2034863

If you think about doing string concatenation in terms of ropes or cords (data structures that represent strings as trees in order to avoid having to copy the string data somewhere else every time one does concatenation), I think if you squint you can see how method 3 and 4 are very similar. However, 4 has less rigid prescriptions for the implementation, and therefore more optimizations are available to us with method 4. I have a few tricks up my sleeve for method 4, although, again, method 1 could be better if we can pull it off.

matu3ba commented 1 year ago

The provided stackoverflow example is only to speed up tree traversal, ie for DFS, say to compare two ASTs structurally by keeping an extra index into the next subtree. This can be helpful, if one wants to quickly identify all children which have changed, but does not construct an optimal tree structure.

I have no experience, how much bitpacking the "from which parent is the node" information for AST hurts performance.

Assume an given AST:

      A
   B     F
 C  D
    E

The optimal representation, assuming optimal is when each node has at max. 4 children and without loosing scope information:

      A
   B    F
 C D E

One can pack E being child from D as (10|30bit of E) into E, which means:

Under the assumption that this does not affect performance (I suspect this is wrong as speculative execution sounds simple to predict for traversal), one can maximize in pre or postorder traversal the number the number of children of each node to a predefined maximum (to make it linear).

In order to get a prefix ordering, we have to do one of the following:

Are there (more recent) benchmarks with performance numbers for SWAR and SIMD on these things in publications or does this mean tinkering with educated guesses?

Validark commented 1 year ago

I think the idea you talked about is rendered unnecessary by better AST representations. With an implicit AST, you don't "need" pointers at all, unless you wanted to jump around/skip over parts of the tree. I am planning on a compile-time boolean that allows one to generate an extra buffer where you can look up level-order pointers for this purpose, but in the compiler we should not need pointers in the AST, because, so far as I can tell, we have to look at the entire AST in the compiler every time, and we can just go in the normal order. I worked on roblox-ts heavily for 2 years, and we pretty much had no choice but to traverse the AST in the normal, DFS order no matter what happened. I haven't studied all of the ASTGen code but from what I have looked at, Zig just traverses the whole AST in the same order no matter what. Allowing for arbitrary skips and hops is only necessary for use-cases that exist outside of a compiler.

We can use an implicit representation because all nodes either have 1. fixed-arity or 2. could be delimited by an end_block node. To point 1, a prefix representation of e.g. operators is completely unambiguous. This was the original motivation behind polish notation. E.g. if I have *+123 I know exactly what the order of operations is, i.e. what nodes belong where, without any need for pointers, so long as I actually store the nodes in that prefix order. The only constraint is you need to have separate tag/kind values for tokens which might be multiple different operators, e.g. - could be a unary negation or a binary subtraction, so you need separate enums for those two.

An alternative to end_block nodes to consider is having a bitmap to tell us if we are looking at the next statement within a block or if the block ended. For more arbitrary traversals, one could use a succinct representation. However, I currently plan on using end_block nodes, I will give more justification for this once it's actually coded up.

Under the assumption that this does not affect performance (I suspect this is wrong as speculative execution sounds simple to predict for traversal), one can maximize in pre or postorder traversal the number the number of children of each node to a predefined maximum (to make it linear).

I do not know how speculative execution could make your idea worse, unless maybe you are afraid that increasing the number of necessary speculative loads would worsen cache coherency. However, in general, I would expect that this could be true even without speculation occurring. In practice, bitstrings are normally pretty good for cache coherency though. A 64-byte cache line is 512 bits! That's kind of a lot of bits, that is, if you can make efficient use of them.

Are there (more recent) benchmarks with performance numbers for SWAR and SIMD on these things in publications or does this mean tinkering with educated guesses?

Assuming that we 1. don't want a post-order layout in memory and 2. don't want to traverse Zig tokens backwards to construct the AST (which might work, tbh, I need to write it to tell), and 3. don't want to traverse a post-ordered AST backwards, option 4 is the obviously best option, simply because performance tuning actually is available to us in ways that it simply isn't the case for the other options. Yes, this involves "educated guesses" in what I think is going to be faster, but, come on, what doesn't involve "educated guesses" when it comes to optimization? I usually have a pretty good sense of what is faster for hardware to do and so I try to think about what will be best on average.

This project, for example, gets somewhere between a 2 to 2.5x speed improvement on real-world files, even though you could look at the code and make a file that specifically only has cases where one could maybe expect worse performance. E.g. I just tested a file filled entirely with a repeating sequence of |*.|*.|*.|*.|*.|*.|. My code will grab 4 characters at a time, do a shift by 0, do the hash function, lookup the hash in a 128-bit table, lookup the popCount of that position in the 128-bit table in a table with all the operators, realize it's not a real operator, grab 3 characters, do a shift by 8, do the same process as before and realize it's not a real operator, grab 2 characters, realize it's not a real operator, then go down to 1, do the same process again but this time it's a valid operator. I am doing roughly 4 times more work than is necessary per byte for a nonsensical file like this. And on top of that, I still will do all of the SIMD work for absolutely no benefit in this case, so literally the tokenizer is bloatware for a non-existent use-case like this. Granted, the table lookups should not be that slow in this nonsense benchmark because the table will be in cache.

       Read in files in 8.891ms (3.11 GB/s) and used 27.63338MB memory with 4 lines across 1 files
Legacy Tokenizing took 380.971ms (0.07 GB/s,  0.00M loc/s) and used 0.138166885GB memory
       Tokenizing took 298.228ms (0.09 GB/s,  0.00M loc/s) and used 55.266752MB memory
       That's 1.28x faster and 2.50x less memory!

Turns out we still have a performance win though, probably due to other things being optimized. The bloat that we have for extreme edge-cases in the real world benchmark where we do this maximum of 3 character backtracking is very minimal.

My point is, it is designed so that the typical case is faster, specifically, matching variable-length tokens faster. It's a balancing act, but it's typically not that difficult to guess the general thing you need to optimize for. For some problems, you probably should collect some data on the common case to figure out what to optimize, but for this problem it's obvious to me that parsing variable-length tokens is going to be the bottleneck, so I optimize for that.

In the case of option 4, we can use SWAR/SIMD for a few things, and it's relatively obvious how I can get performance wins over the naive implementation. It is highly improbable that option 3 is actually faster, which is effectively one of the naive ways to do option 4. The other naive way is to do concatenations by doing a new allocation every time and copying all the data over every time, and deallocating the old string. (This would technically make expression parsing O(n²) by the way.) It's "obvious" to me that copying into a fixed-size buffer and using a linked list in the case of intentionally awful file inputs will be most efficient. I could be wrong, of course, but I will be pretty surprised if someone thinks of a better technique. (I have not fully spelled out my technique here, but you'll see it once I publish it that I have a few tricks up my sleeve).

There will be some tinkering involved with trying out different size buffers, but that's literally me just changing a variable in the code and re-running it and seeing what's faster (for my machine, in the future we could do comptime switches for different targets). I think primarily the optimization work would be in getting the compiler to actually emit the best instructions to do my algorithm, not thinking of a fundamentally different algorithm. Actually, I am expecting that the compiler will give me awful emit for certain operations and be unable to vectorize the code I want it to, and I might have to temporarily dip into assembly until the Zig compiler gets support for more vector operations (I actually already do this in the utf8 validator, mostly written by travisstaloch, ported from simdjson).

Validark commented 1 year ago

Hey @matu3ba and other interested parties, I just wanted to share that I did, in fact, make some progress this week. I mostly finished a lot of the SWAR stuff for the tokenizer (although the utf8 validator still could use some love), did a few more optimizations to the tokenizer, and now I am at the point where the tokenizer is mostly done in the sense that it works how it's going to work. I might try to make a slightly faster algorithm for the operator parsing at some point, but on the whole, it's pretty much as good as I can make it, at least until we get automatic profile-guided optimization. Obviously there is a little tidying up of the code related to comments and assertions that's in order too, but it's getting there.

Thank you so much for your interest, feedback, and suggestions! I am sorry things haven't been moving faster but I am plugging along, and within the next 2 weeks I should have something to show on the AST front. For now, there's some SWAR stuff to look at. I will recombine some of the functions I had separated during testing of different strategies but other than that it looks pretty close to how the end result will look.

One thing you might find interesting is a trick I learned in the course of studying SWAR. It turns out you can do a SWAR mov_mask operation using a single multiply and a shift! Here is my implementation which is based on this article.

For reference, in SWAR we read bytes into a normal register and then try to operate on each byte separately. We can mask out the highest bit of each byte and then we can add 0x7F7F7F7F7F7F7F7F to make each byte have its most significant bit be a 1 if the byte was not 0, and vice versa. I.e. we are overflowing into the top bit if our condition is (not) satisfied. E.g. 0b 0....... 1....... 0....... 1....... has a mixture of msb's set to 1 and 0 in its 4 bytes. With this knowledge, we can do range checks and equality checks pretty easily. For equality, we do an XOR by the bytes we are looking for, and that will make it so the bytes will be 0 if they have what we are looking for. Then we use the aforementioned trick to differentiate between a true and false. For ranges that include 0, we can compute these by simply adding 0x80 - x to each byte to determine if the byte was greater than or equal to x. We can use this strategy twice if the range does not include 0, or we can XOR with a value that moves the desired bytes into a range which includes 0. The aforementioned mov_mask operation extracts the highest bit of each byte and concentrates it in a bitstring. For an 8 byte SWAR vector, we can produce an 8 bit bitstring of the answers we computed. If we still want to put together a 64 bit bitstring for use in our hot-loop, we will have to combine a whopping 8 of these! Not sure if this is the optimal way to do it or not. I ordered a risc-v single-board computer (one of the credit card sized ones) which I will use to test this idea. I also have a few arm single-board computers that I can try this on. I am little worried about the prospect of duplicating the instruction sequence 8 times. I might go down to 4? Or I could just use a loop and try not to unroll it. We'll see what happens in my tests.

Today I made a working shunting yard implementation that sort of proves that the strategy I plan to use for AST generation is feasible and hopefully it will be efficient in the end. Admittedly, at this stage, it is kind of like a proof-of-concept of a proof-of-concept, but I am optimistic that everything can be implemented in an extremely very efficient manner in the coming days.