7sDream / torrent_parser

A .torrent file parser and creator for both Python 2 and 3
MIT License
148 stars 22 forks source link

add functions to calculate v1 and v2 info hashes of torrent files #14

Open milahu opened 1 year ago

milahu commented 1 year ago

currently the v1 hash appears only in this test

https://github.com/7sDream/torrent_parser/blob/23b9e110beb5b91c5498b286bd9d8cce83cfc076/tests/test_info_hash.py#L15-L20

expected:

torrent = torrent_parser.parse_torrent_file("input.torrent", hash_raw=True) 

info_hash_v1_raw = torrent_parser.get_info_hash_v1_raw(torrent) # -> bytes
info_hash_v1_hex = torrent_parser.get_info_hash_v1_hex(torrent) # -> string

info_hash_v2_raw = torrent_parser.get_info_hash_v2_raw(torrent) # -> bytes
info_hash_v2_hex = torrent_parser.get_info_hash_v2_hex(torrent) # -> string

get_info_hash_v2 simply uses hashlib.sha256 instead of hashlib.sha1

binascii.hexlify can be avoided by using hexdigest

info_hash_v1 = hashlib.sha1(info_bytes).hexdigest()
info_hash_v2 = hashlib.sha256(info_bytes).hexdigest()

related: https://stackoverflow.com/questions/46025771/python3-calculating-torrent-hash


stupid question: does parse_torrent_file preserve the sort order of the info dict? since python3, dict should be an ordered dict by default

https://stackoverflow.com/questions/19749085/calculating-the-info-hash-of-a-torrent-file

Be observant that the example torrent file given by Arvid, both the root-dictionary and the info-dictionary is unsorted. According to the bencode specification a dictionary must be sorted. However the agreed convention when a info-dictionary for some reason is unsorted, is to hash the info-dictionary raw as it is (unsorted), as explained by Arvid above.

7sDream commented 1 year ago

parse_torrent_file will use Python's default dict, but it provides a use_ordered_dict argument, to use collections.OrderedDict. This parameter should be used in this scenario. It seems that this test is not rigorous. I will fix it when I have time.

And for now, you can just use the code in the test file(with use_ordered_dict=True) to calculate info hash. For more context, see Issue #13.

7sDream commented 1 year ago

Oh, I notice the sorted you want to disscuess is not what I think of.

If you are talking about lexicographic order, the corrent implementation do not follow this. There is no mandatory requirement for the dictionary to be in order when parsing, and it will not actively perform sorting operations during encoding.

But this seems do not effect calculation of info hash, as long as the encoding step generated key order of info dict is the same as origin file bytes(by adding the use_ordered_dict=True parameter).

milahu commented 1 year ago

There is no mandatory requirement for the dictionary to be in order when parsing

this would be nice to preserve the infohash

it provides a use_ordered_dict argument, to use collections.OrderedDict

this is needed only for python2 and then torrent_parser should use OrderedDict automatically, to preserve the infohash

if sys.version_info[0] == 2:
    from collections import OrderedDict

result = dict()
if sys.version_info[0] == 2:
    result = OrderedDict()

in python3, dict is an OrderedDict

>>> dict(b=2, a=1)
{'b': 2, 'a': 1}

>>> { "b": 2, "a": 1 }
{'b': 2, 'a': 1}

alternative solution: the parser could calculate the infohashes from raw source bytes of the info dict, and store the infohashes in attributes of the result data dict. calculating sha1 and sha256 digests should be cheap enough to make this default for parse_torrent_file. internally, only the raw hashes are stored. the _hex attributes return _raw.hex() (in python3)

torrent = torrent_parser.parse_torrent_file("input.torrent") 

if torrent.has_v1:
    info_hash_v1_raw = torrent.info_hash_v1_raw # -> bytes
    info_hash_v1_hex = torrent.info_hash_v1_hex # -> string

if torrent.has_v2:
    info_hash_v2_raw = torrent.info_hash_v2_raw # -> bytes
    info_hash_v2_hex = torrent.info_hash_v2_hex # -> string

alternatively, we could store the source locations of the info dict, and the user has to read the file again and calculate the digest manually. but IMO, the infohash is always useful when dealing with torrent files