Could you please help me understand how root hash is calculated?

kovalensky commented 1 year ago

I'm kinda lost in understanding of torrent.py's merkle root hash creation, I'm trying to write a function in php to calculate root hashes of files, I can create v1 implementations but still struggling with v2, could you please provide me a simple one block python code function to generate root hash from given file so I could inspect it and learn how to write something equal in php, I know it's too much to ask, but I'm just lost in hash functions, thanks.

alexpdev commented 1 year ago

Have you read the documentation on it? https://www.bittorrent.org/beps/bep_0030.html

I'm kinda lost in understanding of torrent.py's

The merkle roots are not calculated in torrent.py they are calculated in hasher.py. The entire version 2 implementation is contained in the the HasherV2 class.

def merkle_root(blocks: list) -> bytes:
    """
    Calculate the merkle root for a seq of sha256 hash digests.
    Parameters
    ----------
    blocks : list
        a sequence of sha256 layer hashes.
    Returns
    -------
    bytes
        the sha256 root hash of the merkle tree.
    """
    if blocks:
        while len(blocks) > 1:
            blocks = [
                sha256(x + y).digest() for x, y in zip(*[iter(blocks)] * 2)
            ]
        return blocks[0]
    return blocks

class HasherV2:
    """
    Calculate the root hash and piece layers for file contents.

    Iterates over 16KiB blocks of data from given file, hashes the data,
    then creates a hash tree from the individual block hashes until size of
    hashed data equals the piece-length.  Then continues the hash tree until
    root hash is calculated.

    Parameters
    ----------
    path : str
        Path to file.
    piece_length : int
        Size of layer hashes pieces.
    """

    def __init__(self, path: str, piece_length: int):
        """
        Calculate and store hash information for specific file.
        """
        self.path = path
        self.root = None
        self.piece_layer = None
        self.layer_hashes = []
        self.piece_length = piece_length
        self.num_blocks = piece_length // BLOCK_SIZE
        with open(self.path, "rb") as fd:
            self.process_file(fd)

    def process_file(self, fd: str):
        """
        Calculate hashes over 16KiB chuncks of file content.

        Parameters
        ----------
        fd : TextIOWrapper
            Opened file in read mode.
        """
        while True:
            blocks = []
            leaf = bytearray(BLOCK_SIZE)
            # generate leaves of merkle tree

            for _ in range(self.num_blocks):
                size = fd.readinto(leaf)
                if not size:
                    break
                blocks.append(sha256(leaf[:size]).digest())

            # blocks is empty mean eof
            if not blocks:
                break
            if len(blocks) != self.num_blocks:
                # when size of file doesn't fill the last block
                # when the file contains multiple pieces
                remaining = self.num_blocks - len(blocks)
                if not self.layer_hashes:
                    # when the there is only one block for file
                    power2 = next_power_2(len(blocks))
                    remaining = power2 - len(blocks)

                # pad the the rest with zeroes to fill remaining space.
                padding = [bytes(32) for _ in range(remaining)]
                blocks.extend(padding)
            # calculate the root hash for the merkle tree up to piece-length

            layer_hash = merkle_root(blocks)
            self.layer_hashes.append(layer_hash)
        self._calculate_root()

    def _calculate_root(self):
        """Calculate root hash for the target file."""
        self.piece_layer = b"".join(self.layer_hashes)
        hashes = len(self.layer_hashes)
        if hashes > 1:
            pow2 = next_power_2(hashes)
            remainder = pow2 - hashes
            pad_piece = [bytes(HASH_SIZE) for _ in range(self.num_blocks)]
            for _ in range(remainder):
                self.layer_hashes.append(merkle_root(pad_piece))
        self.root = merkle_root(self.layer_hashes)

An oversimplified explanation is that you break the file into pieces just like you do in the version 1 protocol, but instead of using piece-length blocks of the file you instead use 16KiB blocks of the file. Then once you have the entire file broken into 16KiB blocks and their sha256 hashes, you iterate through them pairwise and hash their hashes together, which would leave you with exactly half the number of hashes. Then you apply the same process over again until you finally reach the root hash.

kovalensky commented 1 year ago

@alexpdev Thank you🎉, I managed to make it work in tmrr

alexpdev commented 1 year ago

@kovalensky Right on!

kovalensky commented 1 year ago

Hello @alexpdev again, hope you are doing great. I had a question, that went over my head for a couple of days now.

Can we validate root Merkle hashes for genuinity? The thing is, if we fetch torrent info metadata from peers, by some extent, they could send us fake metadata, with fake hashes. This could be useful for deceiving DHT search engines that show file hashes in web pages for example. And since clients can upgrade connections catching both formats, from v1 to v2 and vice versa from peers, and if clients have errors with validating root hashes of files using v2 connection, they will fallback to v1 (which could have completely different data than files we needed).

piece_layers dictionary in .torrent files have this format:

root_hash : layers_string

My idea was that if we could validate those "root_hash"es, calculating them from layers_string, we could reject fake torrents.

I tried to calculate root hash from layers_string using previous code, but didn't succeed, or maybe I'm doing something wrong?

alexpdev commented 1 year ago

I tried to calculate root hash from layers_string using previous code, but didn't succeed, or maybe I'm doing something wrong?

You must be doing something wrong because that is the only way in which to get the root hash to begin with.

from v1 to v2 and vice versa from peers, and if clients have errors with validating root hashes of files using v2 connection, they will fallback to v1 (which could have completely different data than files we needed).

THis isn't accurate, torrents exchanging data on v1 and v2 swarms simultaneously are required to describe the exact same data. That is why it is necessary to include b0 padded files in the pieces list in the info dictionary for hybrid torrent files.

I highly recommend reading the following links, You may have to read them more then once. I think I had to read the V2 BEP 12 times before I finally figured out what they were even talking about.

https://www.bittorrent.org/beps/bep_0005.html https://www.bittorrent.org/beps/bep_0030.html https://www.bittorrent.org/beps/bep_0051.html https://www.bittorrent.org/beps/bep_0052.html

kovalensky commented 1 year ago

I tried to calculate root hash from layers_string using previous code, but didn't succeed, or maybe I'm doing something wrong?

You must be doing something wrong because that is the only way in which to get the root hash to begin with.

Didn't see your answer notification, probably I tried to find pieces root based on string versions of piece layers, I had to do it with binary format and see if it works, will tell if it will work

alexpdev / torrentfile

Could you please help me understand how root hash is calculated? #166