Why does your Starky SHA256 benchmark run so much faster than Plonky2?

I haven't looked into this that deeply, so I can't say I know for sure exactly why, but I can list a bunch of things that come to mind:

The unit of comparison here is "compression function invocations", because that's where 99% of the cost is. Their example prints the number of blocks in the hash - that's the number of compression function invocations. Their example does 45 blocks (message size 2828 Bytes), so to compare, we need to make their example hash a 15-block (1000 Bytes should do) message. On my MBP this takes ~6s to prove with their example. So the difference is within 2 orders of magnitude, not 4.
It Isn't exactly an apples-apples comparison, because starky is just invoking the compression function 15 times, so it doesn't include constraints for padding and chunking the input, while the plonky2 circuit does include them.
In starky, we can turn rate_bits all the way down 1, which makes FRI significantly faster (4X+). In plonky2, we can't turn it below 3.
In the starky implementation, everything is hand-written polynomial constraints instead of gates. In practice, what this means is that we can represent our constraints with much fewer trace cells since we're not limited to the gates that are available. For instance, in starky we can constrain the xor of two bits using only three trace cells. In contrast, the plonky2 example is instantiating a gate, which allocates its own input/output cells in the trace, for each (small batch of) arithmetic operation - so each 1-bit xor in that circuit ends up using many more trace cells. I suspect the plonky2 example could be much faster with some custom gates for performing bitwise logic.
The stark is written with a custom layout. In particular, the 929-column layout is wide enough to represent a full round of the compression function in a single row of the trace, minimizing the number of "wasted" trace cells. The plonky2 example is currently using the standard layout, which has 135 columns, 80 of which are "routable". What this means is that the circuit builder has to figure out how to place all of the circuits into the trace, and I suspect some cells are being "wasted". I suspect the plonky2 example could be faster if you choose the config such that there's little/no wasted area.
This might not have that large an impact, but there are a some implementation reasons why we may expect starky to be faster overall than plonky2, like the lack of dynamic dispatch / vec copying in starky.

It's also worth metioning there's inherent tradeoffs here. For instance, the starky proof is quite large (~2MB), and starky doesn't support zk, so to use it in pracitce you might need to add a few steps of recursion on top of the initial proof.

Also, Jump recently published some plonky2 gadgets worth checking out, they might be better optimized. https://github.com/JumpCrypto/plonky2-crypto

Sladuca / sha256-prover-comparison

Why does your Starky SHA256 benchmark run so much faster than Plonky2? #3