Validark / Accelerated-Zig-Parser

A high-throughput parser for the Zig programming language.
Other
57 stars 0 forks source link

embedded as fallback without simd (1) and parser design (2-4) #2

Open matu3ba opened 1 year ago

matu3ba commented 1 year ago
  1. As far as I understand, using Zig compiler on embedded (100-300MB RAM) is a potential use cases. Is compatibility without SIMD possible/planned and what would be slowdown and/or drawback?
  2. I guess you plan to make the parser iterative to backtrack: Can the same be applied to zig fmt, which as of now recursively traverses things?
  3. if yes: Incremental parsing planned/possible?
  4. if yes: api to incremental parsing for language server error recovery/robust parsing?

I guess you have probably made up your mind about 1-2 with a draft impementation, so I wanted to ask about it.

Validark commented 1 year ago
  1. Yes, this implementation should be quite favorable towards embedded.
    • Reducing the memory footprint of the AST is one of my goals, which is another step towards supporting 100MB of RAM. Of course the biggest eater of RAM by a longshot is LLVM afaik, so cutting the memory consumption of the initial stages of the compiler by half or 2/3's only helps so much.
    • I use SIMD for two things, parallel table lookups (4 bit in a 16 byte vector) for utf8 validation and matching character classes like newlines, quotes, identifier characters, etc. The first one I think would have to be done one-by-one without hardware support. The drawback of no SIMD is you have to not use SIMD, haha. The character classification stuff can be done with SWAR though, and I know how to do this if the compiler can't figure it out automagically. So you'll still get some benefits of operating on 8 (or 4?) bytes at once. It will be much less of a benefit than operating on 64 byte chunks at once on the latest desktop hardware, but it's embedded, I'm guessing you didn't think it would run at the exact same speed as a desktop.
  2. If you are talking about recursive descent over an AST, it's up to the consumer code whether they want recursive function calls to push and pop from the stack automatically or to do it manually. Even if I provided an iterator to wrap this, the consumer code has to look pretty much the same. The only way to avoid traversing an AST is to skip generating one. Andrew Kelley wants the compiler to solve this by perhaps automatically moving the stack to the heap for potentially infinite recursive calls or just setting a limit at compile time for stack usage and somehow giving back an OutOfMemory error if it comes to that. I personally haven't seen any proposed syntax for that yet so I'm not sure how much Andrew already knows how that's going to look but I think the horse to bet on is recursive function calls. As for my construction of the AST, I am not sure which technique I will ultimately prefer, but it won't matter either way to consumers of the AST.
  3. Yes, incremental parsing is doable. The main question is which optimizations you're allowed to do in an online context, and which optimizations you can still use in an offline context without making your code bloated and complicated. I would prefer using the technique of just copying all the reusable data over to a new buffer, and not bothering with convoluted tree structures which try to reuse data and would require us to do things in much less efficient ways. Matklad pointed out that Carmack had this idea for video games that you only need two buffers, the previous frame and current frame, and you can copy between them each frame as the one that becomes stale switches. Everything is more efficient with this strategy except having to do the potentially massive copy. Although we're theoretically doubling the peak RAM usage, we can use structures that are far less than half the space and much faster to traverse in memory with this strategy, so it should more than pay for itself. It also means we can use the same logic for offline compilation and it will still be optimal.
  4. Definitely. I'd be open to help fleshing out an API. I've never looked at Language Server Protocol in any depth so I don't know what API consumers want but I can use my imagination for some of it. I assume LSP will tell me which bytes changed and what they are now but I don't know much else.
Validark commented 1 year ago

As for my construction of the AST, I am not sure which technique I will ultimately prefer, but it won't matter either way to consumers of the AST.

Actually there is one way in which it could matter. I can determine the maximum depth and consumers should be able to tell with one check whether they'll have enough stack space.

Validark commented 1 year ago

using Zig compiler on embedded (100-300MB RAM) is a potential use cases

Could you give an example of representative embedded hardware? Do they typically have 64-bit trailing zero count? (I think I may have heard of hardware with only a 32-bit ctz operation one time?) Should I make a fallback where only 32 bit operations are efficient?

matu3ba commented 1 year ago

Could you give an example of representative embedded hardware?

Probably more common ones are ARMv7, which can be used with different compatible DRAM sizes. Typical are here 512 MB. See for example the beaglebone. ARMv8 and later all have 64-bit or are only 64-bit mode.

More niche would be these pine64 things, here risc5 based: https://pine64.com/product/128mb-ox64-sbc-available-on-december-2-2022/ (those use integrated PSRAM).

64-bit trailing zero count

On typical embedded low-cost devices with sufficient RAM (>32MB) you typically have MMU and MPU and 32-bit, not 64-bit. I think it is unfeasible to target anything below 128MB, which is roughly what people might use for a webserver and to debug basic stuff without JTAG (I need to check the numbers again on how much memory Linux and a remote debugger needs, because we dont use it at work). See https://developer.arm.com/documentation/dui0068/b/ARM-Instruction-Reference/ARM-general-data-processing-instructions/CLZ for clz.

Should I make a fallback where only 32 bit operations are efficient?

That depends on the level of efficiency. Luckily, Zig has ARMv7 CI-tested and generated, so you can take that as baseline. I dont think (permature) optimization is worth the hassle, unless to prevent significant performance dropdowns.

matu3ba commented 1 year ago

There is currently some stuff missing for ARM what the ABI mandates, so if you run into problems, let me know or complete the things yourself.

Validark commented 1 year ago

I dont think (permature) optimization is worth the hassle, unless to prevent significant performance dropdowns.

It is not premature optimization to think about the performance of your software on target devices. In fact, the whole point of this repo is optimization. In Zig we can switch things to 32 bit super easily for low-end devices. We just need to make a comptime conditional which checks for the CPU features we're looking for. I might order one of these devices and test out performance on there.

Validark commented 1 year ago

I just got done putting in a lot of hours learning SWAR techniques and implementing pretty efficient SWAR fallbacks. I have not committed it yet, as the code still needs some reorganizing, but it's in the works!

Validark commented 11 months ago

Update: I tested out the tokenizer on a RISC-V Sifive u74 and it's ~1.5x faster than the legacy version! With some more work, I might be able to improve it further, but I am happy with this result nonetheless.

matu3ba commented 11 months ago

Sounds awesome. I'm super curious on the SWAR techniques and on the perf graph once you feel it is ready to be shown. :)

Validark commented 11 months ago

Sounds awesome. I'm super curious on the SWAR techniques and on the perf graph once you feel it is ready to be shown. :)

I think it would be fun to present it on Zig showtime. We'll see!