Closed uncomputable closed 4 months ago
This is pretty significant.
Profiling (see #214) shows that upwards of 60% of time is spent on computing various Merkle roots, in this kind of case. The use of Merkle trees for syntax is what enables pruning and helps with sharing, but we do have to run at least one SHA256 compression function per node when computing a Merkle Root. In this example there is slightly more that one node per serialized byte (i.e. per WU). For each node, not only do we need to perform a CMR computation but we need to do an IMR computation as well, which needs to hash not only the node, but the input and output Type Merkle roots. Additionally type Merkle roots themselves need to be computed, which again is potentially on the order of the size of the DAG, as it is in this test case.
The coverage analysis tool tells that we do can do up to 4 calls to the compression function per node (which is approximately per WU). If we estimate the cost of the compression function to be 225 ns, and if we want to maintain our goal of doing no more than 500 ns of work per WU, then we are in trouble because we can only do 2 calls to the compression function per node, and that leaves basically no time to do any other of the type inference work.
A couple of options on how to deal with this issue come to mind:
@apoelstra suggests hashing would be 100x faster with hardware acceleration. I think the improvements would be more modest than that, but it would certainly be faster. We could benchmark with this extended instruction set and see if that solves the problem.
We currently use a budget mechanism to require a certain amount of witness weight to cover the cost of evaluation (which can be exponential in the size of the program). Stuffing the annex with empty data may be necessary to cover the cost of evaluation. (Technically tapscript does the same budgeting; its just that the annex trick isn't required for anything but the most obscene programs.)
We could extend this budget mechanism to count the number of nodes and cover the cost of all the hashing using the same budgeting system.
This solution is my current favourite.
I devised the current bit-based serialization format to try to squeeze Simplicity programs to be as small as possible. But this issue suggests that the current serialization format is "too good".
We could revamp the format entirely to assign smaller codewords to jets. Jets have no parameterization whatsoever, and thus their CMRs and IMRs could be entirely precomputed, thus they wouldn't really incur compression costs. Then the Simplicity combinators can be assigned longer, multibyte codewords which would automatically consume sufficent WUs to cover all their hashing and other processing costs.
At the same time we can replace the current bit-based serialization format with a byte-based one. I'm sure this would make a lot of developers happy to be able to parse byte-streams instead of bit-streams.
I would probably also separate the witness data into a separate witness stack item while we are at it as there is no pressure on us to keep the number of bytes down to an absolute minimum in light of this issue.
- We can presume/require that nodes have SHA2 hardware acceleration
https://github.com/rust-bitcoin/rust-bitcoin/pull/1962 suggests a 6x -10x improvement using hardware acceleration.
- We could eat into the budget
I believe we will need something like this annex witness stuffing anyways for some exotic applications that we don't yet know. Can we be sure that the new serialization format in 3. will be sufficient?
Instead of annex stuffing, can this also be solved by users adding extra weight via the disconnect
fragment. At redeem time, they can reveal disconnected fragments to increase the weight. If this works, there is no need to go the annex route and allows us to keep the format.
@sanket1729
I believe we will need something like this annex witness stuffing anyways for some exotic applications that we don't yet know. Can we be sure that the new serialization format in 3. will be sufficient?
Yes, we definitely need to support annex stuffing -- and we already do support this; we have some utility methods in the Rust lib to compute how much stuffing to do, etc. But the question is whether we should require so much stuffing that it typically 4xes (or more) the program size and negates the segwit discount (and more).
@roconnor-blockstream
- We could revamp the entire serialization format
Whew. These benefits do all sound great. But it would cause us significant scope creep (at a swag I'd expect design alone to take a month, then at least a month each for the Haskell, C and Rust implementations to update, and this wouldn't be entirely parallelizable). And we are already oversubscribed.
What if today we just did the annex thing (eliminating the segwit discount) for MVP purposes; then for "Simplicity 2" which we could target next year we created a new serialization format, and also hopefully got some input about what jets/combinators are commonly used in practice.
Since the two formats are easily interconvertable our tooling could support both and I don't think users would need to know or care. They'd just see that there was a "v2" format which was almost always cheaper on-chain.
I'm a little bit scared that a random program caused so much trouble. Is there a way to search for these hyperexpensive programs?
- We can presume/require that nodes have SHA2 hardware acceleration
If we want Simplicity to run on Bitcoin eventually, then hardware acceleration must become part of the minimum specification of a full node. The latest Raspberry Pi seems to support it.
- We could eat into the budget
This is the best solution we have right now. In my Elements script asset tests, only few programs required padding. If it helps, I could provide a concrete number.
- We could revamp the entire serialization format
Encoding combinators with more bits than necessary sounds interesting. Switching the encoding to bytes would make a lot of things easier, but it would also require us to change a lot of code. Bits allow us to tune the combinator encoding length more precisely than bytes.
Before we do anything, I would try to find more hyperexpensive programs. Right now, we have one data point.
I agree with @apoelstra that we should delay invasive changes until we have released a first stable version of Simplicity.
In my test example, even using sha_ni intrinsics, program still takes more than 20% longer than our target. It's better than 300% longer, but it still doesn't reach our goals.
Fixed in #241.
I found a program that seems to exceed the expected runtime given its cost.
The program is constructed as follows:
Importantly, the type signature of
f(n)
differs depending on how it is used. Sharing happens after the program is constructed.Let's build a C test vector that measures the runtime. The Elements test suite is inaccurate because it runs the same program multiple times.