Question about FlushAtDepth

TianyueLi1227 commented 4 months ago

Hello, I noticed that in the tree.go, there is a function named FlushAtDepth. However in the test part, there are also TODOs say

//TODO: fix this test when we take a final decision about FlushAtDepth API.

I am curious about the current issue with FlushAtDepth, is there anything wrong with its functionality? What kind of context is realted to this API? Thank you very much with the help!

jsign commented 4 months ago

The original use of FlushAtDepth was to serialize mutated nodes in the tree and save them to disk after block processing. That worked but was inefficient since each node serialization was done in the flushFunc passed as a parameter.

The serialization is CPU-bound and the recursive nature meant we were using a single core. For that reason, and since node serialization was the most important use case, BatchSerialize was created, which is more optimized for the task.

TianyueLi1227 commented 4 months ago

Thank you very much for your answer! This is a great update for how to serialize the node in memory into a HashedNode into disk. I am using go-verkle for my own project recently, but I am troubled by the huge occupation of the memory since I stored huge amount of the data into the verkle tree, and it costs around 60GB of memory to run my program properly. Therefore, I am really keen on this Serialize method in order to save the memory.

I just got couple questions for how to use BatchSerialize method.

I notice that there is a limit of 1024 node to be serialize at once. This is ideal when the verkle tree is small. But what should I do if I am trying to serialize a huge verkle tree when the number of serialized node is hard to predicte?
I notice that the HashedNode type is not fully implemented. Do you intend to replace NodedNode with SerializedNode when inserting/getting values from verkle tree so that it could read from the disk? Does that mean I need to construct a way of relating the verkleNode flushed into the disk with its location in the original verkle tree so that it could be recovered when needed?

Thank you very much! Looking forward to your response!

jsign commented 4 months ago

The 1024 is preallocated space. Slices in Go can grow, so there's no limit.
HashedNode is implemented, which doesn't mean it can resolve all expected methods of Node. Is a type of node that only has the "hash" information, but nothing else. It's used to represent a node that we had stored on disk. The idea is that when serializing a tree, we skip HashedNode since that means those didn't change. You can see in the tree mutating methods that whenever we find a HashedNode we use a provided resolver function to pull that node from disk.

TianyueLi1227 commented 4 months ago

Thank you very much for the explanation. I am getting more understanding of the hashedNode. Recently, I am implementing a LRU like funcationality of based on verkle Tree, which will decide when to put some nodes into the disk, and when to put them back to verkle Tree. Since I did not find the implementation of transforming some serializedNode into the hashedNode on verkle tree after the BatchSerialize. Does that means go-verkle library is expecting its users to determine the way how they want to build the NodeResolveFn and decide when to put the serializedNode outputed by BatchSerialization? If not, could you please remind me if there is any useful function I might missed.

Also, by my current design, given a path to some leafNode, I will flush nodes after some certain predetermined depth (say 10), which is kind of similar to FlushAtDepth but it will have a provided path at first and only flush after some depth. My question here is whether this kind of flush is feasible. Personally, I find this flush mechinism to be more suitable to BatchSerialize because they all go down the branch to the leaveNode. But I am worried there could be some details I missed.

jsign commented 4 months ago

Does that means go-verkle library is expecting its users to determine the way how they want to build the NodeResolveFn and decide when to put the serializedNode outputed by BatchSerialization?

Correct. The library doesn't impose any decision on that, so that's for the library client to decide.

My question here is whether this kind of flush is feasible. Personally, I find this flush mechinism to be more suitable to BatchSerialize because they all go down the branch to the leaveNode. But I am worried there could be some details I missed.

If depth is important to you, then FlushAtDepth sounds reasonable. It might not be the fastest way but should be fine. If not you could use BatchSerialize to get perf benefits, and filter out the depth after you get the result. You might end up doing CPU work that gets discarded, but might still be faster than FlushAtDepth, it depends on how many things are in memory.

Let me know when is as good moment to close this issue as resolved.

jsign commented 4 months ago

I'd assume the original question was answered.

ethereum / go-verkle

Question about FlushAtDepth #437