ethereumjs / ethereumjs-monorepo

Monorepo for the Ethereum VM TypeScript Implementation
2.57k stars 744 forks source link

Feature request for WalkController: extracting nodes by level #2856

Open TimDaub opened 1 year ago

TimDaub commented 1 year ago

To use the ethereumjs/trie library for set reconciliation, we go breadth-first, level by level, take those nodes as an array and then send them to remote node for comparing their hashes; we implement a function descend as follows

https://github.com/attestate/kiwistand/blob/e2c8562fc202a7b5450a1d28e9774b52ca512409/src/store.mjs#L162-L220

For that, we keep track of a decrementing value we call level until it reaches zero, in a naive attempt to figure out how far in the trie we can traverse into. Note that the key length of a node itself cannot be used as a substitute for this, as a PMT in Ethereum features several optimizations e.g. ExtensionNodes, which I don't know how to e.g. handle in terms of their key length.

While recently coming across this while investigating a synchronization bug, I noticed that the level variable won't decrement often enough if the trie has a big branching factor, as this unit test shows.

test("store.descend function should correctly handle high branching factor", async (t) => {
  env.DATA_DIR = "dbtestA";
  const trieA = await store.create();

  // Manually create a trie with a high branching factor
  await trieA.put(Buffer.from("0100", "hex"), Buffer.from("A", "utf8"));
  await trieA.put(Buffer.from("0101", "hex"), Buffer.from("B", "utf8"));
  await trieA.put(Buffer.from("0102", "hex"), Buffer.from("C", "utf8"));
  await trieA.put(Buffer.from("0103", "hex"), Buffer.from("D", "utf8"));
  await trieA.put(Buffer.from("0104", "hex"), Buffer.from("E", "utf8"));
  await trieA.put(Buffer.from("0200", "hex"), Buffer.from("F", "utf8"));

  const root = trieA.root();
  const originalLevel = 2;
  const currentLevel = 2;
  const nodes = await store.descend(trieA, root, originalLevel, currentLevel);

  // Check that all nodes are included in the result
  t.is(nodes.length, 6); // it fails and nodes.length is 2

  await rm("dbtestA", { recursive: true });
});

So I've been playing around a bit today and I'm noticing that it is pretty difficult to figure out a nice way to keep track of the tries current level within onFound. E.g. allChildren, walkTrie and onFound don't permit additional data to be passed in, but then how is one supposed to e.g. keep track how often allChildren has been called already. I've actually played several options through also with gpt4 today and none of them really seemed elegant to me, for why I now opened this issue to shed some light on the problem. I think ideally, I could just get all nodes of a level from the walkController in the native implementation.

However, I do feel like using JS's https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Proxy Proxy could actually help keeping track and achieving what I want. We could make onFound accept the level variable, which is then internally used to keep track of the initial level. Furthermore, every node's allChildren could be manipulated with the Proxy pattern where ... but no essentially not because the implementation of allChildren is too complex to proxy any behavior in multiple parts of the module

/**
   * Run all children of a node. Priority of these nodes are the key length of the children.
   * @param node - Node to get all children of and call onNode on.
   * @param key - The current `key` which would yield the `node` when trying to get this node with a `get` operation.
   */
  allChildren(node: TrieNode, key: Nibbles = []) {
    if (node instanceof LeafNode) {
      return
    }
    let children
    if (node instanceof ExtensionNode) {
      children = [[node.key(), node.value()]]
    } else if (node instanceof BranchNode) {
      children = node.getChildren().map((b) => [[b[0]], b[1]])
    }
    if (!children) {
      return
    }
    for (const child of children) {
      const keyExtension = child[0] as Nibbles
      const childRef = child[1] as Buffer
      const childKey = key.concat(keyExtension)
      const priority = childKey.length
      this.pushNodeToQueue(childRef, childKey, priority)
    }
  }

So conceptually, the only way I can think of keeping track of the level externally would be through a pretty weird javascript object that keeps track of all paths I'm traversing down and maintaining a level variable (as opposed to passing this through in the invocation scope), or implementing a sort-of walkController myself.

In fact, I have actually attempted to build my own breadth first trie traversal but only with non-successful results (slight modification from the above mentioned (and buggy) function descend):

export async function descend(
  trie,
  root,
  originalLevel,
  currentLevel,
  exclude = []
) {
  const emptyRoot = Buffer.from(
    "56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421",
    "hex"
  );
  if (currentLevel === 0 && isEqual(emptyRoot, trie.root())) {
    return [
      {
        level: 0,
        key: Buffer.alloc(0),
        hash: trie.root(),
        node: null,
      },
    ];
  }

  let nodes = [];
  // NOTE: We're marking the variable "nodeRef" as _ wildcard here as it
  // doesn't reliably always return the node's hash, but e.g. sometimes the
  // composition of branches of a BranchNode (we think).
  const onFound = async (_, node, key, walkController) => {
    const nodeHash = hash(node);
    // TODO: Would be better if this was a set where all the hashes are included
    // e.g. as strings? It seems very slow to look up something using find.
    const match = exclude.find((markedNode) => isEqual(markedNode, nodeHash));
    // NOTE: The idea of the "exclude" array is that it contains nodes that
    // have matched on the remote trie, and so we don't have to send them along
    // in a future comparison. Hence, if we have a match, we simply return.
    if (match) return;

    if (currentLevel === 0) {
      if (node instanceof LeafNode) {
        const fragments = [key, node.key()].map(nibblesToBuffer);
        key = Buffer.concat(fragments);
      } else {
        key = nibblesToBuffer(key);
      }

      nodes.push({
        level: originalLevel,
        key,
        hash: nodeHash,
        node,
      });
    } else if (node instanceof BranchNode || node instanceof ExtensionNode) {
      const children =
        node instanceof BranchNode
          ? node.getChildren()
          : [[node.key(), node.value()]];
      for (const child of children) {
        const childNodes = await descend(
          trie,
          hash(child[1]),
          originalLevel,
          currentLevel - 1,
          exclude
        );
        nodes.push(...childNodes);
      }
    }
  };
  try {
    await trie.walkTrie(root, onFound);
  } catch (err) {
    if (err.toString().includes("Missing node in DB")) {
      elog(err, "descend: Didn't find any nodes");
      return nodes;
    } else {
      throw err;
    }
  }
  return nodes;
}

This is mainly because it then would really complicate my self-implemented hashing function with an alternative representation of some of the trie nodes

// NOTE: https://ethereum.github.io/execution-specs/diffs/frontier_homestead/trie/index.html#ethereum.frontier.trie.encode_internal_node
export function hash(node) {
  // I had to add these if statements to parse arrays into Branch, Extension and Leaf nodes
  if (Array.isArray(node)) {
    if (node.length === 2 && Array.isArray(node[1]) && node[1].length === 17) {
      node = BranchNode.fromArray(node[1]);
    } else if (
      node.length === 2 &&
      Buffer.isBuffer(node[0]) &&
      Buffer.isBuffer(node[1])
    ) {
      node =
        node[0].length === 2
          ? new ExtensionNode(node[0], node[1])
          : new LeafNode(node[0], node[1]);
    } else {
      throw new Error("Invalid node format");
    }
  }

  if (
    node instanceof BranchNode ||
    node instanceof LeafNode ||
    node instanceof ExtensionNode
  ) {
    const encoded = rlp.encode(node.raw());
    if (encoded.length < 32) return node.serialize();
    return Buffer.from(keccak256(encoded));
  } else if (Buffer.isBuffer(node) && node.length > 0) {
    return node;
  } else {
    throw new Error(
      "Must be BranchNode, LeafNode or ExtensionNode or non-empty buffer."
    );
  }
}

Besides, I've found out in this process that nodeRef does not always returns a Buffer, which represents the hash of a node, but sometimes it also returns the above special case for handling trie nodes in this internal representation.

So a couple more thoughts here to conclude this:

TimDaub commented 1 year ago

Update, I now forked the WalkController to pass in a level variable. Not sure how useful you deem this to the main library: https://github.com/attestate/kiwistand/commit/ecdad2b2d52d7e0200e0371322fd76d63f056e0b, but I could send a PR for that

holgerd77 commented 1 year ago

Hi there, we are in the process of preparing (breaking) releases for the monorepo atm which is time consuming, so I don't think we'll have the ressources to judge on this extensive proposal short-term (sorry for that).

TimDaub commented 1 year ago

are you expecting breaking changes on the trie library interface? I‘m using it quite a bit, also internal structures and it‘d be quite bad if it was broken. Can I get in touch with the maintainer in case?

btw: no worries, I forked the walkcontroller so it‘s all good

holgerd77 commented 1 year ago

are you expecting breaking changes on the trie library interface? I‘m using it quite a bit, also internal structures and it‘d be quite bad if it was broken. Can I get in touch with the maintainer in case?

btw: no worries, I forked the walkcontroller so it‘s all good

We do not have a specific maintainer for our trie library (actually: for all our libraries). Yes, these releases will change the API for all our libraries to some extend, if you are interested in a change summary see https://github.com/ethereumjs/ethereumjs-monorepo/pull/2832.

Yes, this will need some upgrade work if you plan to upgrade. But that's the way of libraries I guess, we want to evolve them, but also do not make it too hard for users to upgrade? Sorry, not possible to solve this target conflict. 😋

You should be able to stay on the current version for some time if you do not want to have the work right now, especially the trie library is not change-intense in terms of EIP or API changes/additions.

ScottyPoi commented 1 year ago

@TimDaub --

The "optimizations" you mentioned kind of obscure my ability to conceive of the Ethereum Trie in "Levels" like an ordinary merkle tree.

The "Patricia" structure is condensed prior to the Trie hashing, which is how we end up with ExtensionNodes, and why some LeafNodes have more "nibbles" than others.

I find some contradictions if I try to think about "Level 10" of such a structure, since some nodes will hold longer sections (more nibbles) of the key path.

Is Level 10:

  1. "the set of nodes that are each 10 nodes away from the root" or
  2. "the set of nodes 10 nibbles down in all directions"

Either way, 10 levels down one path will look very different than 10 levels down another path.

TimDaub commented 1 year ago
  1. and here‘s that function: https://github.com/attestate/kiwistand/blob/c26ba04f5c3354787a16ea48a299858d9981369d/src/store.mjs#L166

But we could also make it based on nibbles. Tbh, I was surprised that there isn‘t a good way to descend as part of the library itself.