Avoid repeated state transition validation

RGB-WG / rgb-core

RGB Core Library: consensus validation for private & scalable client-validated smart contracts on Bitcoin & Lightning

https://spec.rgb.tech

Apache License 2.0

207 stars 52 forks source link

Avoid repeated state transition validation #122

Closed zoedberg closed 1 year ago

zoedberg commented 1 year ago

While using RGB on testnet we noticed slow times for sends and especially refresh operations. Therefore, we created a project, rgb-lib-stress-test, which purpose is to detect where rgb-lib spends most of its time during these operations.

The project creates and uses 2 rgb-lib wallets, the first wallet issues an RGB20 asset and then sends it to the second wallet, and continue exchanging the asset back and forth for the desired number of times.

The first strange pattern we noticed was that operations were progressively slower as the number of transfers increased.

Then, using a patched rgb-core containing extra debug logs (branch debug_consignment_validation), we found out that most of time was spent validating the consignment. In particular we have noticed a pattern that should explain why validation time increases based on the number of transitions contained in the consignment. The method validate_contract is calling validate_branch for each endpoint transition, so each one is validated along with all its ascendants on each call. This way a single call to validate_contract ends up veryfing the same node (transition) multiple times.

For example, after 4 rounds there are 7 endpoint transitions and the method will process the nodes multiple times. Example list of nodes and times they get processed in a single call to validate_contract:

{'9a9936e19cc9a33003c2299b9568733936886ad5ece0ca860528325a6c375fd3': 12,
'0f139171ad5544ffcf450e120e5fc67037c6a1524947f5ee3179400615331b45': 12,
'6acbaa0f3117d442747d849c223c64d32fb5688824a47b6a8d4e83d3e590f89b': 7,
'2aa3375e8211334c1ff44f97fe250672924da6fa903502e1c889cf98a84f933d': 4,
'7cd08aa1467271c966b94ba778e8787fc93c99cc5e6e6bc200e3269a20151e14': 3,
'6b7e047c4055b7d125fe50bcbf7041872edbfdeedb14f19a2d591ea75328d245': 3,
'03cec8ea1182a16d64d98a43aaba8a37a60740e3a2c5efd19b96e361e970a138': 2,
'2176af248349e65d7cdc7662110830708b2a9b2bf78cdab532ffd278c93bdc62': 1}

I think we could add a DB table (if not already available) where we can save processed node IDs and modify the validate_branch method to have it add processed elements and check the list to skip any already validated nodes. This way we should fix the main slowness issue while also adding an extra cache layer that should prevent clients from processing the same nodes multiple times across different API calls.

@dr-orlovsky this is probably not the best solution but what do you think? Do you have any suggestions on how to improve this?

dr-orlovsky commented 1 year ago

Thank you for the detailed investigations! I think that adding DB table will not work: while multiple consignments may have same state transitions with the same transition ids, they may have different concealed and revealed items and must be re-validated for each consignment. So instead an in-memory index of validated state transitions should be kept during each consignment validation. I will work on doing that.

cryptoquick commented 1 year ago

Excellent work on making the stress-test, @zoedberg!

Also, in BitMask, everything we do with RGB is in-memory, if you ever need to reference that, @dr-orlovsky: https://github.com/diba-io/bitmask-core/tree/development/src/operations/rgb

We decided early on that it'd be better to do it this way since it'd be a more flexible approach across platform contexts.

nicbus commented 1 year ago

The increase in transfer times I noticed while writing the stress test was enough to deem it too much for a user to wait. To give some context, the first transfer already required a total of ~30 seconds and as soon as the 8th transfer (back and forth 4 times) it already took ~5 minutes. Results may vary from machine to machine and you're welcome to double-check by trying out the stress test. These numbers are the sum of the time required by both sender and receiver with no intermediate delays and they should give a general idea of the impact the issue makes.

Using Iris and sending assets back and forth between two wallets also yields a noticeable slowdown in just a few iterations, which is where our initial analysis started.

Given the numbers, I think an in-memory solution would not resolve the core issue, as the node would still need a very long time to validate all transitions (be it at startup or upon first operating a contract). This would be especially true for consignments containing a long transition history. Matters would be even worse on devices with constrained resources, like mobile devices, where CPU time is not abundant and battery consumption is also to be taken into account.

Saving verified transitions to DB should also help with first-time validation, as a node seeing a contract for the first time via a consignment with a long history will need to validate all past transitions, which means a lot of time, which means it could get interrupted mid-way and have to restart from the beginning. In this scenario, if already validated transitions were saved to DB the validation could resume from where it stopped, making it more robust on top of saving a lot of time.

Going back to the fact that "multiple consignments [...] may have different concealed and revealed items", I'm not sure I understand why. I thought past transitions were immutable and so could be verified only once by each node. This probably does not apply to the last transition in the consignment, which I guess will have the same node id but different concealed and revealed items when it will be spent, so that will probably need to be re-validated and not saved to DB, although here the node would actually be validating its own transition, which might not be necessary.

@dr-orlovsky could you please explain in mode detail why past transitions need to be re-validated and especially why their concealed and revealed items might change in time?

dr-orlovsky commented 1 year ago

It is not the validation that consumes the time but an Electrum server requests. Of course when you have 100 state transitions it is better to do 100 Electrum requests than 100000 (when each transition is validated multiple times) - but even 100 electrum requests will last forever. Use of database would not solve the issue anyhow, and we can't have database in RGB Core which is I/O-less and should be WASM compilable. So:

Short-time solution is to avoid repeated Electrum requests for the same txid with the in-memory cache of already checked data.
Long-term - get the rid of Electrum and have BP Node done with batch ultra-fast request

fedsten commented 1 year ago

@dr-orlovsky not everybody has the same knowledge of RGB Core as you do, could you elaborate on why exactly we cannot have a database? Also could you clarify whether or not past transitions are actually immutable? If they are I am sure that one way or another we can find a better solution not to re-validate them multiple times.

dr-orlovsky commented 1 year ago

The main thing is that DB does not anyhow solve any part of the problem, which is related to the use of Electrum slow server and repeated validation. Repeated validation for a consignment is solvable without DB; and no DB can replace Electrum server - unless it is a size of Electrum index, and that DB is called "electrum replacement" and is BP Node.

RGB Core is a library implementing RGB consensus protocol. The consensus protocol can't rely on a specific DB, since this will result in implicit consensus rules and unexpected hardforks, like was the case with Bitcoin Core LLDB unexpected hardfork back in the days.

Also, if RGB Core will have a DB, it would have a lot of troubles to be run on hardware wallets etc, which have to do the validation.

The only acceptable solution (which was planned from the day 1) is to replace slow Electrum server with RGB-optimized different indexer - a BP Node.

fedsten commented 1 year ago

Thanks for the reply. Could you confirm whether or not past transitions are actually immutable? I thought they were but from your previous answer where you said that "while multiple consignments may have same state transitions with the same transition ids, they may have different concealed and revealed items and must be re-validated for each consignment" it seems the may not be. I want to make sure I understand correctly how RGB works so I would really appreciate if you could clarify on this.

dr-orlovsky commented 1 year ago

Sorry for missing to reply.

The transitions are immutable indeed! What can change is our knowledge about transition data (because of zero knowledges and reveal/conceal procedures).

Transition is validated in parts which are known or which can be validated via zero knowledge proofs. But for unknown data the validation does not happens and some schema may allow that. For instance, the NFT token data may be not known for the tokens not owned by the validating party - and that's normal. Such tokens may be invalid - and it would not violate validity of other NFT tokens. However the new consignment may contain a state transition previously validated by the local node - but with revealed info about invalid NFT token, which was previously unknown to the node. In this case the consignment has to be rejected - but the user must not lose his already owned tokens.

dr-orlovsky commented 1 year ago

In other words:

RGB is partially replicated state machine. And while the local state is valid, it is partial - and a consignment may deliver a new partial state which in parts previously unknown can be invalid. Thus, such a consignment has to be rejected. But this state may be contained only in state transitions, i.e. in the parts of them which were previously unknown to the local node.

Schema may prohibit validating state transition without provable information about all of its outputs (for instance this is the case for current fungible assets schema) - but other schema may allow that (which is fine for the NFT case as described above). Thus, we can't rely on the fact that if state transition was valid before for part of its outputs it will be valid for other outputs when their state is revealed.

In other words, consignments are validated without stash - and that's why the validation can be run by a RGB core library without RGB Node connectivity (and hardware wallets with no RGB Node avaliable to them). The only thing which is needed is the access to bitcoin transaction index based on validated blockchain.

dr-orlovsky commented 1 year ago

The BP Node is anyway the very next must have thing - we already require it to complete LN part (which can't run without re-indexing blockchain on top of the index done by Bitcoin Core and Electrum due to short channel ids from BOLT specs). The high-performance skeleton for it is already being designed by be for last two months and the first version is expected during the feb.

nicbus commented 1 year ago

I think this would be easier to tackle if we split it into two issues:

(bug) re-validating the same transition multiple times in a single consignment validation call (validate_contract -> validate_branch)
(enhancement) speed up consignment validation in general and across rgb-core Validator::validate calls (db, electrum, bp-node)

I suggest keeping this issue for the bug (1), as the initial post is mainly about it, and create a new issue for the enhancement (2), where we can continue the related discussion.

@dr-orlovsky let me know if you agree and if you want me to open the new one

nicbus commented 1 year ago

the enhancement part has been split into #123, as agreed with @dr-orlovsky on telegram

this issue is thus now only related to the validation slowness bug inside a single Validator::validate call. @dr-orlovsky can you please replace the enhancement tag with bug, for clarity?

thinking about this issue and comparing the stress-test output (from branch debug_consignment_validation) with rgb-core's code, in the attempt to help fix this as soon as possible, it seems to me that the for loop around validate_branch in validate_contract (src/validation.rs, line 420 as of commit ee04bb8) might not be necessary, as validate_branch already recurses up to all parents of the starting node

if that's the case, an in-memory cache might not be necessary, but I leave the final decision to @dr-orlovsky as I only partially know the code and might be wrong