Open casey opened 4 years ago
@nijynot Just cleaned this up, expanded it, and formatted it!
One potentially useful resource here is the SSB protocol: https://ssbc.github.io/scuttlebutt-protocol-guide/
It's a simple, minimally configurable authenticated/encrypted stream protocol that's easy to implement and doesn't have much hair. Already in use in the SSB decentralized ecosystem.
This is great stuff, thanks for the link!
Current peer discovery mechanisms, including the DHT, trackers, and peer exchange, used in BitTorrent leave much to be desired.
This issue is a dump of notes regarding issues, and ways peer discovery might be improved.
Trackers currently must have a stable domain name or IP address. Ideally, trackers should be able to hand out a public key, and then be reachable via looking up that public key in a DHT. This would allow them to not need a stable domain name or IP address. Additionally, it would allow all connections to trackers to be authenticated and end-to-end encrypted without needing SSL cert from the centralized CA infrastructure.
Trackers should be able to operate over Tor, or a similar mixnet.
The BitTorrent Mainline DHT is easily surveilled. If my understanding of how it works is correct, one could create a large number of DHT nodes with different node IDs, and use them to observe DHT lookups, which would allow collecting all metainfo hashes. Then, a client could be used to download those metainfo files and torrent contents, allowing an attacker to learn about all torrents, and all peers sharing those torrents.
All traffic should be encrypted, even if it isn't authenticated, in order to raise the cost of mass surveillance by requiring active MITM proxying.
The fact that trackers, peer exchange, and the DHT are separate seems due more to historical accident than design considerations. Ideally a single mechanism could handle all of these different patterns.
Although generally speaking, overly general systems are bad, it seems like a directory that allows looking up items and/or by public key or hash is a good candidate for being a general purpose system that isn't related to any particular application. BitTorrent demonstrates this. The Mainline DHT is not part of BitTorrents data-exchange protocol, and the Mainline DHT can be used for non-BitTorrent use-cases.
Sybil attacks on the mainline DHT should be made more costly. This might be done via low-value ln payments, or per-connection PoW.
The keyspace should be at least 256 bits, as opposed to BitTorrent's 160. I'm not sure if there's much value in keys greater than this.
The identifiers that nodes query for should be fully opaque, and receiving a query should not tell you anything about the topic of the query. This is in contrast to the Mainline DHT, where receiving a query gives you the metainfo hash of the torrent in question.
An alternate scheme might involve, for a BitTorrent-like application, hashing the content, deriving a private key from the hash, deriving a public key from the hash, and then hashing the public key derive the identifier used to query the directory.
The content hash would be shared out of band to agents you wanted to be able to retrieve the contents. Agents could then derive the ID, query the directory for peers, connect to those peers using a connection that was both authenticated and encrypted with the derived pubkey.
Queries to the directory would reveal no information about the content, and would not provide attackers with the ability to retrieve the content themselves.
Connections that are authed and encrypted with the derived pubkey could only be man-in-the-middled by attackers who had the privkey or content hash. The additional protection this provides would be determined by how widely the content hash were shared, but would certainly be far better than the casual mass surveillance that is possible today; attackers would have to learn about content hashes out-of-band, and store all content hashes and derived privkeys that they wished to surveil.
The above
content -> hash -> privkey -> pubkey -> id
scheme could be simplified. Public keys could be replaced by a symmetric key, eitherhash(content)
itself orhash(hash(content))
. Additionally, the pubkey could be used directly as theid
. I would have to think a lot about it, but these simplifications make me nervous.These simplifications don't provide any tangible performance benefit.
Additionally, using the hashed pubkey provides some defense against the future development of large quantum computers.
Also, I suspect that other desirable features might be enabled by the use of symmetric crypto.
Private trackers could be implemented simply by peer-lookup by public key, combined with stats reporting directly to peers. The
private
flag in torrent metainfo is enforced by policy only, so there is no loss in security, privacy, and confidentiality. The private flag could by an indication from a peer returning peers that the user agent should avoid combining swarms. Public trackers could simply omit this indication.Using a pubkey itself, as opposed to its hash, as the query topic would allow entries in the directory to be signed, so that attackers who do not posses the pubkey could not insert entries into the directory. This would allow nodes to learn the pubkey. To avoid this, an additional derivation step could be used:
content -> hash -> privkey -> pubkey -> privkey -> pubkey
This would allow directory entires to be signed and verified by arbitrary nodes, while not revealing a public key that is used elsewhere in the protocol.
Should the directory have a mode where items can be looked up by hash? This is how the mainline DHT operates, but if users want to perform lookups for a hash, the could derive keypair, look up the pubkey (or derived identifier), receive a message containing the contents, and then verify it matches the hash. I'm not sure what supporting hash-based lookups actually gets you over this, and key-based lookups seem more general.
Look-up types:
Query a public key, and receive a record which is signed by that public key. (DNS)
Query a hash, and receive a record which is the preimage of that hash.
Query a public key, and receive the IP and port of a peer that can auth and encrypt a connection with that public key.
Not all of these lookup types need dedicated support. It would be ideal to have a single lookup type which can be used by applications to implement the above.
Convergent encryption is a cool idea. Usually, file contents are encrypted, but the communication channel itself could be authed and encrypted.
Trackers should be able to generate and share a secret for two peers, so they can auth and encrypt.
Would it be valuable to rotate identifiers periodically? You could derive the identity of torrents using a timestamp, or a recent Bitcoin block hash. Could this make analysis more difficult?
Queries for an ID end in one of three ways:
I don't think any further query types are necessary. If the peer is interested in further data, they can request it via the authed, encrypted connection to the peer they found. The connection should support an arbitrary duplex data stream, so peers can either proceed to communicate with some other, arbitrary protocol over that stream, or they can communicate some information that allows them to initiate another connection.
Service discovery can be initiated by bootstrap nodes, mDNS, or brute-force scan of the IPv4 address space.
Node IDs should always be pubkey hashes, so you can authenticate and encrypt outgoing queries.