Transitioning to stronger hash function

the8472 commented 7 years ago

With the published collision attack on SHA1 bittorrent may become less attractive for use-cases which don't just want simple integrity-checking (for which it still is perfectly fine) but also authentication.

We should come up with a plan to transition to stronger hashes for all uses, ideally while maintaining backwards compatibility for a transition period.

Thoughts:

To reuse existing software/infrastructure we'll have to truncate to 160bits in some places but not necessarily all of them.
To avoid bloat up torrents any further the new hashes should probably be exclusively in a merkle tree format to exist alongside with pieces

the8472 commented 7 years ago

And if we're doing merkle with a new hash algorithm that also means we don't have to pay attention to backwards compatibility with BEP 30, instead we can incorporate new ideas from issue 29.

the8472 commented 7 years ago

Also add hash agility to avoid another migration in the future.

arvidn commented 7 years ago

I was just going to mention that. It's a good opportunity to start with a hash tree from scratch

ssiloti commented 7 years ago

I agree that this should be rolled into a new merkle tree torrent spec. Clients should only have to support two types of torrents: legacy/SHA1 and merkle/\<new hash function>.

arvidn commented 7 years ago

Maybe even going so far to have a tree per file. but maybe that's too much (it would have quite significant consequences of the internals of clients)

the8472 commented 7 years ago

I'm sortof in favor of having a tree per file but I recognize its complexity it adds. It would alleviate some headaches about appending to torrents and deduping files. But to be backwards-compatible we'll have to include BEP47 paddings because it requires piece alignment. In a from-scratch spec they would be implicit.

On a more procedural note, how would we handle declaring parts of BEP3 as legacy?

gubatron commented 7 years ago

@dionyziz I think your input on hashing functions to consider/benchmark for the purposes of hashing file pieces (and merkle tree creation) for integrity verification (and other operations) of torrent files would be most appreciated given your field of expertise. (My guess is that SHA-256 and SHA-3 would be the first candidates, but I've no clue in terms of performance of this functions vs the old SHA1)

ssiloti commented 7 years ago

@the8472 The body of BEP3 should be left unmodified and all changes described in a new BEP. Depending on how extensive the changes are, it might make sense to use BEP3 as the base for the new BEP to create a new stand-alone protocol spec. Once the new BEP is finalized we can add an "Updated by" or "Obsoleted by" header to BEP3 referencing it.

bramcohen commented 7 years ago

The breakage of sha1 is a collision rather than a reversal, so it isn't a huge problem for BitTorrent (if someone sends you a torrent with broken data in it they can screw you in lots of other and worse ways ways) but this may be a good impetus for doing all the backwards-incompatible things we'd like to do.

There are three backwards-incompatible things I think are good ideas:

(1) Change the infodict to be a bencoded string of its own rather than an included object. This will get around the problems of bencoding being not really canonical. This is a trivial no-brainer if we're already doing something incompatible.

(2) Switch to separate hashing for the different files in a multifile torrent. There's no technical downside to this except that it's a pain to implement. We should probably just bite the bullet and do it.

(3) Switch to sha256 and using a merkle tree. For various reasons sha256 is clearly the right hash to use. The main problem with bep30 is that it results in the same file hashing to different roots based on the piece size, which is completely unnecessary. The hashing should go down to fixed size terminals, no more than 16kib but maybe 8 or 4 for more flexibility, and it should not change with the piece size. The piece size should still be specified in the info dict but it's there so that peers are all speaking the same language when they discuss what they have, without affecting hashes. This requires that piece sizes only be powers of two, so apologies to anyone fond of setting piece sizes to prime numbers.

Suggestions for other worthwhile changes which can only be done incompatibly are welcome. Most extensions don't require incompatibility.

kyhwana commented 7 years ago

For an example, see the .torrent files inside https://keybase.pub/kyhwana/sha-2/ folders a/ and ab/ "a.pdf" inside a/ and ab/ have the same SHA1's but different SHA256 (and are different images) Note that I set the piece size to 64KB, so the piece has the same SHA1's.

I imagine you could position the changed data inside pieces at the right offsets to send malicious pieces and have it verify as well.

dionyziz commented 7 years ago

This is a security problem for BitTorrent, because users may trust a .torrent file download, for example using an HTTPS link on a trusted domain and trusting the PKI system, but may not trust their peers, because their network could be compromised. If a SHA1 collision can be found between a malicious binary file (e.g. a virus) and a non-malicious binary file (e.g. a video game), this is a very reasonable attack: It is a reasonable threat model to assume the network is compromised, but HTTPS is not.

I recommend upgrading to SHA256.

bramcohen commented 7 years ago

To be clear: The current problem with sha1 is generation of collisions, not reversals, so if you can trust the source of the .torrent file but not peers you're still fine, and that's the usual threat model. That said it's a problem worth fixing and the only reason we haven't done so already is that it requires a backwards incompatible change.

dionyziz commented 7 years ago

Trusting the source of a .torrent file but not your peers is not fine. The attack could work like this: Initially, the attacker generates two different files, A and B. The attacker also creates a torrent file for A and makes sure that there exists a SHA1 collision with the contents of file B. Let's say these are C++ source code files. A is an innocent and useful program, while B is a malicious program. Then the attacker sends A to a trusted third party, for example the package maintainer of a certain linux distribution. The trusted third party manually verifies that the source code for A is benevolent and that the torrent contains the correct hash and vouches for its validity by hosting the verified .torrent file on their HTTPS website. The attacker subsequently replaces the file contents of A on the network with the contents of B during the download by modifying peer data in transit. The torrent client will not be able to detect this tampering.

This attack does not require a reversal, only a collision generation.

zookozcash commented 7 years ago

I'd like to put in a plug for BLAKE2. It has a deeper security margin than SHA-256 has and it is much faster. BLAKE2 is widely used. It is already included in your favorite crypto library.

BLAKE2 is a descendant of DJB's Salsa20/Chacha20 designs, and its immediate ancestor BLAKE1 was thoroughly reviewed during the SHA3 competition. (Those are slides that I presented at ACNS 2013.)

I'd be interested in a measurement of whether hash function performance difference between SHA-256 and BLAKE2 makes a difference in BitTorrent's usage. Even if hash function performance isn't a critical issue in this case, BLAKE2 is a fine choice.

bramcohen commented 7 years ago

A reasonable argument can be made for blake2. Its advantages are that it's faster in software and has a larger hash size. Its disadvantages are that it's less standard, meaning people will give me more grief about it if I use it, and that it's slower if there's hardware acceleration which is likely to become commonplace for sha256 and nothing else, and that the speed of this operation doesn't matter much anyway (yes I'm talking out of both sides of my mouth here) and that the longer key length takes up more space without a practical improvement to the already ludicrous security parameter. With using merkle roots maybe the longer length doesn't matter in practice. Or blake2s could be used for shorter hash lengths and better performance on 512 bit inputs. Or maybe blake2b could be used for the terminals and truncated and blake2s could be used climbing up the tree. (blake2b is slower on 512 bit inputs because its block size is 1024.) Truth be known I'd be fine with any of these options, I just want to do the thing which will cause the least bullshit complaining.

zookozcash commented 7 years ago

You can have BLAKE2b or BLAKE2s generate shorter outputs. This is similar to truncating the outputs, except that the intended output length is mixed in so that H(x, gimme-bytes=24) is not equal to H(x, gimme-bytes=32)[:24]. Either way.

BLAKE2 is fairly widely-accepted nowadays, I think. I'm not sure if people would give you grief about using it. It tends to be the default hash function in new crypto designs — things like the Password Hashing Competition, libsodium, etc.

jeremyBanks commented 7 years ago

Is this expected to be a one-time update to a new algorithm, or should some accommodation be made for future algorithm changes, as the8472 suggested?

Presumably, a new default algorithm such as SHA-2-256 or Blake2b will be specified. But the spec could also describe a standard way of indicating the hash algorithm in-band. Clients could choose to implement support for additional algorithms, and they could be adopted as needed without further changes to the protocol.

One option for indicating this on the wire or in magnet links could be the "multihash" digest format used by IPFS (draft spec seems to be here). This adds a ~2-byte prefix to each digest indicating the algorithm used and the digest length (kind-of similar to the BitTorrent protocol handshake prefix). There would be minimal additional complexity for clients that don't implement algorithm flexibility: they would just verify and discard or add the default two-byte hash prefix (e.g. 0x1220 for SHA-2-256, or 0x1214 if it's truncated to 160 bits).

bramcohen commented 7 years ago

Hash algorithm extensibility is a delusion. If a change is made to use a hash algorithm which isn't currently defined then that change will, by definition, be incompatible, even if you play a semantic game of pretending you were prepared for it. In protocols in general extensibility mechanisms are a huge source of complexity and security problems, and all the new hash functions we're talking about here should be good until the end of time, so it's better to pick one and call it a day.

the8472 commented 7 years ago

@bramcohen

(1) Change the infodict to be a bencoded string of its own rather than an included object. This will get around the problems of bencoding being not really canonical. This is a trivial no-brainer if we're already doing something incompatible.

This seems unnecessary. Most of those problems are solved or worked around these days. And it would prevent a transition format that supports both legacy and new hashes.

(2) Switch to separate hashing for the different files in a multifile torrent. There's no technical downside to this except that it's a pain to implement. We should probably just bite the bullet and do it.

agreed

(3) Switch to sha256 and using a merkle tree. For various reasons sha256 is clearly the right hash to use. The main problem with bep30 is that it results in the same file hashing to different roots based on the piece size, which is completely unnecessary. The hashing should go down to fixed size terminals, no more than 16kib but maybe 8 or 4 for more flexibility, and it should not change with the piece size. The piece size should still be specified in the info dict but it's there so that peers are all speaking the same language when they discuss what they have, without affecting hashes.

great idea

This requires that piece sizes only be powers of two, so apologies to anyone fond of setting piece sizes to prime numbers.

😢

Hash algorithm extensibility is a delusion. If a change is made to use a hash algorithm which isn't currently defined then that change will, by definition, be incompatible

Strong disagreement. It obviously is not about forwards compatibility in the sense of supporting not-yet-existing hash functions. It is about making future migrations possible without changing any structural aspects of the protocol. Yes, clients would still need updating to include a new hash function, but all the code branches deciding which hash to use should already be there. To force implementers to spend time on that now we could specify two hash functions. Since functions of vastly different design usually don't fall at the same time that would also mean if one falls we'd still have another one in operation and enough time to add a 3rd one to the spec while deprecating the vulnerable one.

All it takes is a "blake2" or "keccak" string in the info dict and possibly magnets.

Unlike interactive protocols we wouldn't need any negotiations either. Torrents would simply contain a declaration of what they are. Clients can then either complain that they don't support it yet or that something is obsolete and should not be trusted.

and all the new hash functions we're talking about here should be good until the end of time, so it's better to pick one and call it a day.

Unlike asymmetric cryptography those hash functions are not mathematically proven to be NP. And historically hash functions had somewhat limited lifetimes. Or maybe you're saying the end of time will soon be upon us?

@zookozcash

You can have BLAKE2b or BLAKE2s generate shorter outputs.

We will probably need different truncation in some places since we can accommodate 256bit or even 512bit hashes in some places but have to make do with 160bit lengths in other parts of the protocol (DHT, Trackers, handshake).

zookozcash commented 7 years ago

https://tahoe-lafs.org/~zooko/preimage-attacks-color.html began life as an attempt to update and correct that Val Aurora piece. My piece is currently missing 2 right-hand columns for 2015 and 2016 in which nothing was changed. I haven't checked, but I assume the shattered.io attack on sha1 was just using reference [51] from https://tahoe-lafs.org/~zooko/preimage-attacks-color.html

On Feb 24, 2017 01:30, "the8472" notifications@github.com wrote:

@bramcohen https://github.com/bramcohen

(1) Change the infodict to be a bencoded string of its own rather than an included object. This will get around the problems of bencoding being not really canonical. This is a trivial no-brainer if we're already doing something incompatible.

This seems unnecessary. Most of those problems are solved or worked around these days. And it would prevent a transition format that supports both legacy and new hashes.

(2) Switch to separate hashing for the different files in a multifile torrent. There's no technical downside to this except that it's a pain to implement. We should probably just bite the bullet and do it.

agreed

(3) Switch to sha256 and using a merkle tree. For various reasons sha256 is clearly the right hash to use. The main problem with bep30 is that it results in the same file hashing to different roots based on the piece size, which is completely unnecessary. The hashing should go down to fixed size terminals, no more than 16kib but maybe 8 or 4 for more flexibility, and it should not change with the piece size. The piece size should still be specified in the info dict but it's there so that peers are all speaking the same language when they discuss what they have, without affecting hashes.

great idea

This requires that piece sizes only be powers of two, so apologies to anyone fond of setting piece sizes to prime numbers.

😢

Hash algorithm extensibility is a delusion. If a change is made to use a hash algorithm which isn't currently defined then that change will, by definition, be incompatible

Strong disagreement. It obviously is not about forwards compatibility in the sense of supporting not-yet-existing hash functions. It is about making future migrations possible without changing any structural aspects of the protocol. Yes, clients would still need updating to include a new hash function, but all the code branches deciding which hash to use should already be there. To force implementers to spend time on that now we could specify two hash functions. Since functions of vastly different design usually don't fall at the same time that would also mean if one falls we'd still have another one in operation and enough time to add a 3rd one to the spec while deprecating the vulnerable one.

All it takes is a "blake2" or "keccak" string in the info dict and possibly magnets.

Unlike interactive protocols we wouldn't need any negotiations either. Torrents would simply contain a declaration of what they are. Clients can then either complain that they don't support it yet or that something is obsolete and should not be trusted.

and all the new hash functions we're talking about here should be good until the end of time, so it's better to pick one and call it a day.

Unlike asymmetric cryptography those hash functions are not mathematically proven to be NP. And historically hash functions had somewhat limited lifetimes http://valerieaurora.org/hash.html. Or maybe you're saying the end of time will soon be upon us?

@zookozcash https://github.com/zookozcash

You can have BLAKE2b or BLAKE2s generate shorter outputs.

We will probably need different truncation in some places since we can accommodate 256bit or even 512bit hashes in some places but have to make do with 160bit lengths in other parts of the protocol (DHT, Trackers, handshake).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bittorrent/bittorrent.org/issues/58#issuecomment-282234025, or mute the thread https://github.com/notifications/unsubscribe-auth/APp9QPH_BaKN2DENpixe_MOJR3S6u9veks5rfpUbgaJpZM4MKBbT .

the8472 commented 7 years ago

@bramcohen

but this may be a good impetus for doing all the backwards-incompatible things we'd like to do.

Something that may also be useful to shrink the metadata size is to use directory trees instead of flat lists. In some multifile torrents the per-file information actually dominates the size and not the pieces.

files: [{
  path: ["aaaaaaaaaaaaaaaaaaa","bbbbbbbbbbbbbb","f1"]
},{
  path: ["aaaaaaaaaaaaaaaaaaa","bbbbbbbbbbbbbb","f2"]
},{
  path: ["aaaaaaaaaaaaaaaaaaa","bbbbbbbbbbbbbb","f3"]
},{
  path: ["aaaaaaaaaaaaaaaaaaa","bbbbbbbbbbbbbb","f4"]
},{
  path: ["aaaaaaaaaaaaaaaaaaa","bbbbbbbbbbbbbb","f5"]
}]

takes up a lot of space in torrents with many small files. Instead we could do

files: [{
  path: ["aaaaaaaaaaaaaaaaaaa","bbbbbbbbbbbbbb"],
  files: [{
    path: ["f1"]
  },{
    path: ["f2"]
  },{
    path: ["f3"]
  },{
    path: ["f4"]
  },{
    path: ["f5"]
  }]
}]

gubatron commented 7 years ago

Hash algorithm extensibility is a delusion. If a change is made to use a hash algorithm which isn't currently defined then that change will, by definition, be incompatible

Strongly disagree here. Here's great example where this could be of great utility

Last night I was thinking about the problem Bitcoiners have with the initial sync of the blockchain and how it could greatly be improved by all nodes using libtorrent.

If a torrent's metadata specified what hashing algorithm has been used for its pieces in the info section, Bitcoin clients could seed and announce (themselves on the DHT) each block as a separate merkle torrents, where the merkle hash included in the torrent metadata would correspond to merkle root specified for the block header in Bitcoin-land, other cryptocurrencies could make use of this no matter what hashing algorithm they use if the hashing algo were specified.

dionyziz commented 7 years ago

@gubatron admittedly, the bitcoin case doesn't need authentication – the blockchain can be downloaded over untrusted torrent peers and even using an untrusted bittorrent file. Authenticity can subsequently follow by verifying the proof of work contained within.

gubatron commented 7 years ago

@dionyziz yes, in the case of bitcoin, if nodes were to sync using torrents, the only way to trust the torrents (doesn't matter their source) is if their inner merkle-tree sum matches the merkle root on the corresponding block header for that torrent.

I guess I might as well ask, as this is the part that is still missing in my mind.

Do torrents support variable piece sizes?

In the case of regular file sharing, can you have smaller pieces so that a piece doesn't span across the end of a file and the beginning of another one?

The idea would be to make each transaction in the block the equivalent of a piece that gets hashed as one of the merkle tree nodes, so that the merkle root in the block header is the same as the one defined inside the torrent. This would be one hell of a custom torrent, but I think it could fly if variable piece sizes are doable in the torrent metadata.

the8472 commented 7 years ago

In the case of regular file sharing, can you have smaller pieces so that a piece doesn't span across the end of a file and the beginning of another one?

This is more or less achieved by aligning files in the piece-space via paddings in the end. Clients not aware of paddings download zeroes. Those that are effectively download a shorter piece.

If you define one file per block then you can have that today.

But I think bitcoin is a distraction from bittorrent's core uses. It's nice if it can be mapped nicely onto BT, but the main goal is to distribute files.

arvidn commented 7 years ago

@bramcohen regarding distinguishing the leaf size of a hash tree and the piece size, I think Steven had a good point (in the other thread about a new merkle tree extension). Fundamentally, the only reason to have pieces be larger than 16kiB is to not spend a lot of bandwidth on the BITFIELD- and HAVE messages. However, it might make more sense to provide alternatives that are more compact/efficient that those. I spent some time with an extension for that (but there's some more work to be done there). There's also the binmap paper.

MuxZeroNet commented 7 years ago

No one have mentioned magnet links yet. Will existing magnet link URNs also be vulnerable? Do we have to migrate to a new URN like this?

>>> import base64
>>> import hashlib

>>> s = hashlib.sha512("Replace SHA-1 with stronger message digest functions!").digest()
>>> base64.b32encode(s)
'EYVH4MBTKSZQOARCINCZZXUARJFP2SY4GXZITGS27VSZKUANE27XMNUZZ52O6UP6GUKS3LMVSGWYVBKAL3CLY5AXKN4C4HDYLBY4J2A='

>>> new_infohash = base64.b32encode(s).lower().strip("=")
>>> new_urn = "bsha"
>>> new_maglink = "magnet:?xt=urn:%s:%s" % (new_urn, new_infohash)
>>> new_maglink
'magnet:?xt=urn:bsha:eyvh4mbtkszqoarcinczzxuarjfp2sy4gxzitgs27vszkuane27xmnuzz52o6up6guks3lmvsgwyvbkal3cly5axkn4c4hdylby4j2a'

arvidn commented 7 years ago

In principle the piece and hash tree hash function could be different than the one used to calculate the info-hash, but it wouldn't make much sense to upgrade SHA-1 of pieces without also upgrading the info-hash function. Magnet links normally use hex-encoding for the info-hash though (early versions used base32, but that's mostly phased out)

the8472 commented 7 years ago

@MuxZeroNet

Will existing magnet link URNs also be vulnerable?

Existing hashes are never vulnerable to collision attacks. This is not specific to bittorrent, it's a consequence of what collision attacks do. But it is difficult to tell whether a hash predates the first collision attack on a hash function.

Do we have to migrate to a new URN like this?

Are you talking about the syntax of magnets (btih) or the hash function used? We might be able to mostly keep the syntax or simply extend it to keep both legacy and forward compatibility. But we will need to use a new hashfunction to avoid certain attack scenarios that would make bittorrent less viable for some distribution use-cases.

@arvidn

Fundamentally, the only reason to have pieces be larger than 16kiB is to not spend a lot of bandwidth on the BITFIELD- and HAVE messages.

It also limits the worst-case memory use for keeping track of which pieces each peer has. If they are at 50% and select their pieces at random instead of batches then clever in-memory encodings won't save you.

Not to mention that such extensions would increase the complexity of a new protocol if they were strict dependencies. I think being able to specify >16KiB piece sizes would help keeping things simple.

the8472 commented 7 years ago

As noted in the draft, we could make the BEP 6 (fast extension) state machine part of the new protocol without importing the fast extension wholesale.

bramcohen commented 7 years ago

I'm going to post replies here for now rather than on the draft but will start commenting over there later.

With respect to hash parameterization: If somebody wants to move off the new standard again at some point in the future they're free to use a new key in the dictionary to indicate so. There are already plenty good hooks for that, and adding more now won't help anything.

For the hash algorithm specifically, I'm open to discussion and feedback, but am declaring that I'm only taking two candidates seriously: sha256 and blake2b with 512 bit hashes. The advantage of sha256 is that it's probably good until the end of time and faster with hardware acceleration and the standard thing the actual cryptographers advocate. The advantage of blake2b is that it's clearly good until the end of time and faster in software than sha256. The disadvantage of blake2b is that the hashes are 512 bits, which isn't a big deal for the size of merkle roots but results in extra bandwidth transferring them over the wire.

For extension purposes, it's an interesting question of how to have torrents which support both the old and the new formats. To minimize mischief it makes sense to put the new info inside the info dict as a bencoded string under a new key, so that peers using the old data format are at least verifying that the new data is corresponding.

New and old compatibility will of course require padding files in old-style data.

New and old compatibility is sounding increasingly like a huge pain, but if people think it's worth it we should go ahead.

It may be good to have the leaves go down to 8 or 4 KiB instead of 16, but I'd like to see numbers on how much of a performance hit that takes. There's also a question in the peer protocol of whether requests can be made for amounts of data greater or smaller than 16KiB and how hash info is requested. This is getting a bit into the weeds and I'll have to read the draft as a starting point.

I'm leery of forcing piece sizes to be no greater than 16 KiB out of concerns about bandwidth, memory, and CPU usage. We've discussed bitfield compression before, but that does nothing when someone has half the file, and trying to get too cute would get in the way of rarest first. I'll have to go read through the existing proposals later.

New style magnet links can be recognized by having a longer key. Making the new info be inside the old info keeps the old ones working.

It clearly makes sense to require a fast extension style state machine. The legacy one is simply inferior. The other fast extensions aren't so important, but I'll have to refresh my memory of what they are before having an opinion.

Having multiple files in the same directory be listed next to each other is clearly a good idea. It feels cleaner to make directory nesting follow bencoding nesting, so in each directory there's a list of files and a list of subdirs, and each subdir has a name and possibly more subdirs.

That's it for now. Hopefully this results in mostly agreement and productive discussion, although there will probably be at least one flame war.

the8472 commented 7 years ago

New and old compatibility is sounding increasingly like a huge pain, but if people think it's worth it we should go ahead.

I think it is essential to get people to use it. Almost nobody is using merkle torrents right now because there's no clear upgrade path.

It clearly makes sense to require a fast extension style state machine.

Ok, I'll add that.

To minimize mischief it makes sense to put the new info inside the info dict as a bencoded string under a new key, so that peers using the old data format are at least verifying that the new data is corresponding.

I have different ideas here. But the flamewar can wait until I actually have written down how to create the hybrid format. I'm trying to design the new format first and hybrid then as an extension from the perspective of new and old format.

With respect to hash parameterization: If somebody wants to move off the new standard again at some point in the future they're free to use a new key in the dictionary to indicate so. There are already plenty good hooks for that, and adding more now won't help anything.

Normally future extensions, i.e. new keys must be backwards compatible. If we change hashes and want to avoid going through the whole hybrid torrent thing again we need a key today that can eventually say "I'm a future version" so that clients can say "I don't understand this, update is needed" instead of failing in unspecified ways. That's basically hash agility starting with 1 available option and a note that others might be added in the future. More than one options is not necessary, but may help to weed out any problems that would otherwise arise in the future when we try to switch again.

bramcohen commented 7 years ago

It makes some sense to have a hook which unambiguously indicates 'this uses a new incompatible feature' as opposed to 'this thing is in error'. There's no need for that to be specific to hash functions though.

zookozcash commented 7 years ago

Some cryptographers do recommend BLAKE2. If you talk to a cryptographer who studies number theory, they'll recommend SHA-256. If you talk to a cryptographer who studies ciphers and hashes, they'll recommend SHA-3 or BLAKE2.

As for the output size, it is totally okay to generate 256-bit outputs from BLAKE2b. It's documented in the papers, standardized in the RFC, and supported in the API.

On Feb 27, 2017 3:00 PM, "Bram Cohen" notifications@github.com wrote:

It makes some sense to have a hook which unambiguously indicates 'this uses a new incompatible feature' as opposed to 'this thing is in error'. There's no need for that to be specific to hash functions though.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bittorrent/bittorrent.org/issues/58#issuecomment-282869998, or mute the thread https://github.com/notifications/unsubscribe-auth/APp9QDFYE_OJpYIlywB9zhc7iC7sr-jeks5rg0dmgaJpZM4MKBbT .

bramcohen commented 7 years ago

The feedback I've heard from cryptographers has mostly been advocating sha256 over sha3. It's certainly the only one which is getting hardware acceleration. I haven't heard many comparisons to blake2.

I'd like to hear feedback on hash sizes though. If there's no meaningful issue with 512 bits then the conservative approach is to simply go with blake2b with 512 bit hashes.

zookozcash commented 7 years ago

I'm surprised to hear of cryptographers advocating for SHA-256 over SHA-3. That's interesting.

About hash size, BLAKE2b can generate any size output. What size do you want!? (I recommend 256-bit.)

On Feb 27, 2017 3:16 PM, "Bram Cohen" notifications@github.com wrote:

The feedback I've heard from cryptographers has mostly been advocating sha256 over sha3. It's certainly the only one which is getting hardware acceleration. I haven't heard many comparisons to blake2.

I'd like to hear feedback on hash sizes though. If there's no meaningful issue with 512 bits then the conservative approach is to simply go with blake2b with 512 bit hashes.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bittorrent/bittorrent.org/issues/58#issuecomment-282874378, or mute the thread https://github.com/notifications/unsubscribe-auth/APp9QEz0p6GuHDmkR_N-b_glv5V4v_VMks5rg0tBgaJpZM4MKBbT .

bramcohen commented 7 years ago

My data on sha256 vs sha3 is from the years following the standardization of sha3, so maybe people feel differently now.

The output size should either be 256 bits or 512 bits. Either is fine.

the8472 commented 7 years ago

I guess in the metadata format we can do without hash flexbility if we add a more generic "requires new version" indicator in the info dict. But in magnets we'll still need something that indicates the algorithm if we'll eventually want to upgrade, otherwise there would be an automatic downgrade attack if clients had just to try several different algorithms. Either the multihash format mentioned by @jeremyBanks or including an additional string in the magnet

ssiloti commented 7 years ago

A 256 bit digest is surely sufficient. SHA-256 seems like the obvious choice. It has seen ample cryptanalysis with no breaks against the full round function and it is the most likely to have hardware support.

Tilka commented 7 years ago

some hash benchmarks by Daniel J. Bernstein: https://bench.cr.yp.to/results-sha3.html

bramcohen commented 7 years ago

The relevant benchmark is relative speeds of doing hash trees with 16KiB vs. 8KiB vs. 4KiB leaves.

the8472 commented 7 years ago

What would be the use of smaller leaf sizes?

bramcohen commented 7 years ago

Smaller leaf sizes would allow smaller requests, so low transfer rates could work better. 16KiB is more than ten times the mtu, so it's a little on the big side.

the8472 commented 7 years ago

request messages are offset-length based, you can already have smaller requests today. you just have to delay hashchecking.

the8472 commented 7 years ago

Can I at least have a list of acceptable candidates for the hash function? I'll just pick one for the sake of having something in the draft. The state of discussion is a bit ambiguous right now.

zookozcash commented 7 years ago

BLAKE2b

On Wed, Mar 1, 2017 at 5:20 PM, the8472 notifications@github.com wrote:

Can I at least have a list of acceptable candidates for the hash function? I'll just pick one for the sake of having something in the draft. The state of discussion is a bit ambiguous right now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bittorrent/bittorrent.org/issues/58#issuecomment-283515624, or mute the thread https://github.com/notifications/unsubscribe-auth/APp9QDyavhugdeq7p2PMS1uf8737QRtDks5rhgtNgaJpZM4MKBbT .

bramcohen commented 7 years ago

The hashes likely to have anybody behind them are sha256, sha3, blake2b, and blake2s (and maybe a mix of blake2b and blake2s). I continue to maintain that making it parameterizable is a bad idea, and after getting feedback from some more crypto people it sounds like sha256 is viewed as fine by everybody and is depended on by everything, so that should be viewed as the default.

bramcohen commented 7 years ago

The advantage of going smaller than 16KiB is that if you split smaller requests across peers then you can hash check them individually. That said noone seems interested in going smaller on requests, so I'll just drop that thought. But since nobody ever deviates from 16KiB, why don't we make that the forced amount? Then requests just have to specify an index without specifying a request size, so there's less complexity and less concerns about what request sizes are and are not acceptable, which is a bit of a concern because peers can force extra data to be sent to them after a choke happens by requesting large piece sizes so an already started request has to be finished even after a choke happens.

the8472 commented 7 years ago

The advantage of going smaller than 16KiB is that if you split smaller requests across peers then you can hash check them individually.

Reducing leaf hash sizes doesn't help there unless you also reduce the piece size. Clients do not have access to the leaf hashes until after they have verified a whole piece and can derive them from the data itself.

Why don't we make that the forced amount?

At this point? Ontological inertia. Source code exists, let us reuse it if there is no good justification for the effort to change it. We're already doing a fairly big redesign of the metadata, there's no need to touch the wire protocol.

At least in principle low-bandwidth clients could do <16KiB requests to keep the congestion-induced latency for control messages low (maybe coupled to µTP-MTU?). I don't think any implementation actually does this, but at least it's possible.

arvidn commented 7 years ago

16 kiB has been (for a long time) the upper bound on request sizes (which I think makes sense).

It still makes sense to request smaller blocks for the tail piece, which is done today. With more tail pieces, which the proposed hash tree would create, it becomes even more useful to be able to request smaller blocks.

bittorrent / bittorrent.org

Transitioning to stronger hash function #58