bitmagnet-io / bitmagnet

A self-hosted BitTorrent indexer, DHT crawler, content classifier and torrent search engine with web UI, GraphQL API and Servarr stack integration.
https://bitmagnet.io/
MIT License
2.01k stars 76 forks source link

Non-compliant DHT implementation #11

Closed the8472 closed 6 months ago

the8472 commented 8 months ago

https://github.com/bitmagnet-io/bitmagnet/blob/84adcf9726df403ef4d5116d4c5d9b27b4950145/internal/dht/server/server.go#L198-L206

This looks like it's not servicing any incoming requests. This is obviously a not-spec-compliant DHT implementation which means it's harmful to a P2P network because it uses resources without providing services.

If you expect many people to actually use this software this will lead to abuse of the network which will mean compliant implementations will have to implement defenses to detect such behavior and not cooperate with such peers.

I recommend writing a proper BEP5 implementation as a base (or using a library) and then adding BEP51 support on top.

The same applies to downloading torrent metadata. If they're stored on disk then such a server should also be able to serve the metadata to other clients if they ask.

mgdigital commented 8 months ago

Thanks @the8472, I appreciate the feedback and this issue can remain open until a solution is reached whereby users of this app are not deemed to be abusing the network without giving anything back.

A few issues, considerations and possibilities:

In order to service a torrent metadata request we must be able to reassemble the full meta info. Bitmagnet allows (and is currently configured by default) to discard the majority of this such as the pieces, and in some cases files info. It can also be configured to save all the info that would be required, but is not currently able to service such requests. The aim is to save that information that would be useful to a person browsing the indexed content - with a magnet link, the extra data is extraneous for this purpose and would consume a prohibitive amount of disk space without some sort of filtering or curation.

As a counter argument to users of this app abusing the network - by indexing hashes the intention is that users will (hopefully) ultimately download, seed and share content that would otherwise have gone undiscovered. The seeding-in-place and federation features that you can find on the roadmap would go further in allowing users to give back. I appreciate neither of these points addresses the concern you've raised at the DHT protocol compliance level - though I'd argue it's possible to "give back" in other ways too.

A viable protocol-level solution could look something like this: we'd have a TTL on the full-fat metadata, so for example we'd be able to service a request for anything that was indexed in the last 24 hours, after which the pieces info would be discarded. The BEP5 and BEP51 implementations should be doable too.

Finally, I originally posted this in one small Lemmy community, stating that it was an alpha preview and looking for input from beta testers and early adopters. It has since made its own way onto HackerNews. The DHT implementation was largely borrowed from Magnetico (https://github.com/boramalper/magnetico) a widely used app which is likewise unable to service these requests.

I'm open to further input, constructive suggestions or even better pull requests addressing this issue, especially from those familiar with these protocols as you clearly are.

the8472 commented 8 months ago

In order to service a torrent metadata request we must be able to reassemble the full meta info. Bitmagnet allows (and is currently configured by default) to discard the majority of this such as the pieces, and in some cases files info. It can also be configured to save all the info that would be required, but is not currently able to service such requests.

Not ideal, but understandable. I was briefly pondering whether a sparse download could work, but that would prevent verification of the infohash.

I have not spent as much time on thinking about fairness in metadata transfers compared to DHT operations, so I don't have much good advice here. Just try to think about efficiency and keeping the burden on others small.

A viable protocol-level solution could look something like this: we'd have a TTL on the full-fat metadata, so for example we'd be able to service a request for anything that was indexed in the last 24 hours, after which the pieces info would be discarded.

Well random nodes wouldn't even know to request metadata from your peer unless you just happen to get propagated via PEX at the time, but that would mean keeping connections open which has its own cost. Perhaps adding the indexer's IP to magnet links could make sense for the initial metadata download.

You could also consider keeping the metadata in the long term (for users that have enough space) because having the metadata enables things (such as local file lookup) that the magnet link doesn't. Downloading a .torrent file synthesized from the info dictionary also allows a torrent client to start downloading sooner compared to using a magnet link.

You could also distinguish between v1 and v2 torrents. A (non-hybrid) v2 torrent's info dictionary should be much smaller than than a v1 torrent so they'd be easier to keep. Granted, non-hybrid torrents may still be fairly rare, I haven't done a survey on that. Edit: Having thought about it, this might not work either since v2 torrents without the pieces layers outside the info dictionary should be considered invalid per spec, so keeping the partial torrent around would only be of limited use.

As a counter argument to users of this app abusing the network - by indexing hashes the intention is that users will (hopefully) ultimately download, seed and share content that would otherwise have gone undiscovered.

There is a mismatch between the amount of metadata you need to index to get a useful-to-random-user index of the network vs. the number of torrents they will then go on to download and upload in an equitable manner. Due to the random nature of hashes it's basically "everything" vs. "a few".

The DHT implementation was largely borrowed from Magnetico (https://github.com/boramalper/magnetico) a popular and widely used app which is likewise unable to service these requests.

I raised similar issues back when it was originally announced. At that time it was even more exploitative and it returned responses intended to manipulate other nodes to get more traffic. It was one among several harmful implementations that lead me to work on making DHT nodes more robust against such attacks.

I'm open to further input, constructive suggestions or even better pull requests addressing this issue, especially from those familiar with these protocols as you clearly are.

If you have questions I can answer them (or you can come to ##bittorrent on libera.chat), but Go isn't my preferred language so I'm most likely not going to offer code contributions.

mgdigital commented 7 months ago

Hi @the8472 , can I invite any feedback on the BEP5 and BEP51 implementations here: https://github.com/bitmagnet-io/bitmagnet/pull/21

A few comments with reference to https://www.bittorrent.org/beps/bep_0005.html:

Each node has a globally unique identifier known as the "node ID." Node IDs are chosen at random from the same 160-bit space as BitTorrent infohashes [2]. A "distance metric" is used to compare two node IDs or a node ID and an infohash for "closeness." Nodes must maintain a routing table containing the contact information for a small number of other nodes. The routing table becomes more detailed as IDs get closer to the node's own ID. Nodes know about many other nodes in the DHT that have IDs that are "close" to their own but have only a handful of contacts with IDs that are very far away from their own.

In Kademlia, the distance metric is XOR and the result is interpreted as an unsigned integer. distance(A,B) = |A xor B| Smaller values are closer.

When a node wants to find peers for a torrent, it uses the distance metric to compare the infohash of the torrent with the IDs of the nodes in its own routing table. It then contacts the nodes it knows about with IDs closest to the infohash and asks them for the contact information of peers currently downloading the torrent. If a contacted node knows about peers for the torrent, the peer contact information is returned with the response. Otherwise, the contacted node must respond with the contact information of the nodes in its routing table that are closest to the infohash of the torrent. The original node iteratively queries nodes that are closer to the target infohash until it cannot find any closer nodes. After the search is exhausted, the client then inserts the peer contact information for itself onto the responding nodes with IDs closest to the infohash of the torrent.

I might have missed something but this concept of "closeness" seems completely arbitrary to me, and no more preferable than simply selecting known good peers at random, which is what I'm doing in this implementation. If someone can convince me that doing this is beneficial then we can look at it as a follow-up, but I figure in either case that quickly getting out a basic implementation which satisfies the protocol contract is preferable in the first instance than following this to the letter.

The routing table covers the entire node ID space from 0 to 2160. The routing table is subdivided into "buckets" that each cover a portion of the space. An empty table has one bucket with an ID space range of min=0, max=2160. When a node with ID "N" is inserted into the table, it is placed within the bucket that has min <= N < max. An empty table has only one bucket so any node must fit within it. Each bucket can only hold K nodes, currently eight, before becoming "full." When a bucket is full of known good nodes, no more nodes may be added unless our own node ID falls within the range of the bucket. In that case, the bucket is replaced by two new buckets each with half the range of the old bucket and the nodes from the old bucket are distributed among the two new ones. For a new table with only one bucket, the full bucket is always split into two new buckets covering the ranges 0..2159 and 2159..2160.

I haven't done this, I'm simply maintaining a set of expiring tables of node IDs, addresses, hashes and good/bad peers. Same goes as for the previous, it could be done as a follow-up although I'm slightly skeptical about the benefit/complexity ratio here.

A few bits still to cover off:

the8472 commented 7 months ago

I might have missed something but this concept of "closeness" seems completely arbitrary to me,

Have you read the kademlia paper? Closeness is important to get O(log(N)) performance.

key-value pairs are only stored in specific sections of the keyspace (rather: nodes nearest to the destination). So any lookup will want to make progress towards that target. Node A randomly sampling node B's randomly-populated routing table would only have a small (albeit non-zero) chance of providing one (or preferably: multiple) contacts that is yet-closer than what A already knows. On the other hand if nodes structure their routing tables in a way that prefers locality (knowing more about their local keyspace) and also provide a closeness-to-query-destination-sorted sample in RPC responses then iterative lookups converge much faster towards the target region.

Note that the list-of-contacts maintained for individual lookups and the routing table are separate things. For BEP51 one would want to use a different approach compared to other iterative lookups, I've written up one possible implementation approach here: https://github.com/the8472/mldht/blob/master/docs/keyspace-traversal.md

mgdigital commented 7 months ago

I get it now thanks.

FWIW, and to the point you made on HN, I think more of the little bedroom projects such as this would "bother" to implement BEP5 properly if the link I've posted above wasn't the first Google result, and there were some easier-to-find resources - it's rather impenetrable, at least for the first 5 reads of it. I've found some good resources now. I'll get round to putting them on the website at some point.

For context, this project started with me stumbling on the archived Magnetico project and naively thinking "I can pinch this and spruce it up a bit and make it more usable and it might be quite popular", and here we are 5 days after me making it public. I knew zero about the protocols and Kademlia and have been cobbling things together with what I could find. It basically worked and seemed good enough from the (selfish) point of view of the app's functionality. It's been a fun and steep learning curve.

Getting this right is the top priority for the project at the moment but will take a bit more time.

the8472 commented 7 months ago

bittorrent.org is the official specification, but not implementation guidance or a tutorial on decentralized network architecture.

I'm not sure how to improve it without turning it into a 10-page document. Well, the link to the paper needs to be fixed of course....

mgdigital commented 7 months ago

This is pretty good if you haven't seen it already: https://kelseyc18.github.io/kademlia_vis/basics/1/

the8472 commented 7 months ago

Ah well, that page (like many others) is based on the kademlia preprint paper, not the final paper. It does explain some basic concepts fine but the routing table in particular is outdated and will cause confusion if you try to square it with BEP5

See these SO posts for additional explanations: https://stackoverflow.com/q/47507317/1362755 https://stackoverflow.com/a/51161732/1362755

mgdigital commented 7 months ago

It's reassuring that others seem to agree this is as clear as mud :)

mgdigital commented 7 months ago

Hi @the8472 , I think I've just about cracked this. That PR has gotten pretty big (as the crawler has been rewritten around the new routing table) but the meat of the changes as far as DHT compliance are:

Routing table: https://github.com/bitmagnet-io/bitmagnet/tree/dht-responder-bep5/internal/protocol/dht/ktable/btree BEP 5 & 51 responses: https://github.com/bitmagnet-io/bitmagnet/tree/dht-responder-bep5/internal/protocol/dht/responder

The routing table implementation doesn't exactly look like 160 "buckets" that can "split", but it is nearly functionally equivalent. It's basically a binary tree that lets things in based on various rules. So as per your SO post (https://stackoverflow.com/questions/32129978/highly-unbalanced-kademlia-routing-table/32187456#32187456), we'll let a new item C into the table if:

To this I have added:

These last rules became necessary because (for this application) you want to be constantly discovering new nodes to sample info hashes from, and without this you soon get to the point where no more will fit in the fullest buckets.

One source of confusion was these conventions: https://wiki.theory.org/BitTorrentSpecification#peer_id - this seems very suboptimal if you're trying to achieve even keyspace coverage as peer ID prefixes will inevitably coalesce around the popular BT clients. Seems like it would be better to indicate the client using a suffix instead. I can only assume that for most BT clients the DHT is a secondary concern. Weirdly most of the peers I discover don't seem to have these prefixes - are there a lot of clients not following this; the list seems to cover most of the main ones? For the moment I'm generating a completely random client ID.

I'm going to look separately into how we can respond to some meta info requests. I'll probably end up keeping a limited selection of them in memory so we can service some requests....

the8472 commented 7 months ago

These last rules became necessary because (for this application) you want to be constantly discovering new nodes to sample info hashes from, and without this you soon get to the point where no more will fit in the fullest buckets.

You're conflating the routing table and the lookup contacts here. The routing table is long-lived (per §2.2 of the paper). Each lookup (§2.3) on the other hand maintains its own ephemeral set of contacts that is separate but seeded from the routing table. A keyspace-wide sweep for infohash sampling is a special kind of lookup.

If neither of the above rules pass then evict the furthest node in the table to make room for C (assuming C would not itself be the furthest)

You seem to be missing the replacement cache of the buckets then? See §4.1 of the paper. This way the buckets can maintain an oldest-contact-first policy for the main entries and a least-recently-seen policy for the replacements, which balances stability and having fresh nodes on hand when a replacement is needed.

One source of confusion was these conventions: https://wiki.theory.org/BitTorrentSpecification#peer_id - this seems very suboptimal if you're trying to achieve even keyspace coverage as peer ID prefixes will inevitably coalesce around the popular BT clients.

peer_id has nothing to do with the DHT, it's only used in the TCP and tracker protocols.

mgdigital commented 7 months ago

I don't think I've deviated much from the specification here. There are separate processes keeping track time last seen and ping/evict as appropriate. The table is long lived, and as long as you're not the furthest node and you're responsive, you'll remain in the table indefinitely. I don't store the buckets by time last seen but with a K of 160 it's an inexpensive sort. As per the paper:

The most important procedure a Kademlia participant must perform is to locate the k closest nodes to some given node ID

And this is what the table is optimised for.

A keyspace-wide sweep for infohash sampling is a special kind of lookup.

Yes the implementation is using a separate lookup for this.

on the other hand maintains its own ephemeral set of contacts that is separate but seeded from the routing table

I'll look at the replacement cache for seeding the table with senders of unsolicited requests. For the crawler, a peer would always have been verified as responsive before attempting to insert it in the table.

peer_id has nothing to do with the DHT, it's only used in the TCP and tracker protocols.

Okay I'll just keep it random then - it can't be completely unrelated as I see many peers with these prefixes (but also many without).

the8472 commented 7 months ago

Any DHT node that puts its peer ID in its node ID is doing something really strange. Are you sure you're not looking at the optional version field?

I don't think I've deviated much from the specification here.

I don't understand what you're doing. Lookup contact accounting and the persistent routing table are quite different things. It's not like one has to optimize one at expense of the other

mgdigital commented 7 months ago

Are you sure you're not looking at the optional version field?

Nope I'm looking at the peer ID. For example I can see a peer with prefix 2d71423432353... that would correspond to QBittorrent 4.2.4; maybe it has that ID completely by chance but it seems unlikely.

I don't understand what you're doing... find node/get peers/get lookups require just a distance-to-target-ID list

I think this is where the misunderstanding is. The app itself is not trying to find any particular target peer/torrent, but rather index whatever can be found by chance. For this purpose any node ID is fine and the distance is not relevant - the structure of my routing table serves only for servicing incoming requests with peers close to a requested target. (However, the fact that the crawler is querying peers from the routing table will have the side effect of biasing it towards discovering peers and hashes close to its own client ID.)

It's possible in future that the app will want to handle such "in flight" traversals of the DHT like a typical BT client would but not right now. At such time then yes this lookup would be a separate thing from the routing table.

sample-infohashes .... has no age preference, it can contain tons of non-visited candidate contacts, its home-ID is constantly moving

For this I'm taking a random sample of the table I'm using to service get_peers requests. In my implementation this uses the same binary tree as for the peers table - I assume this structure is also appropriate here as you'd want to prefer knowing about hashes close to your own ID. Eviction will be based on least recency of an incoming get peers or announce request.

the8472 commented 7 months ago

maybe it has that ID completely by chance but it seems unlikely.

I grepped some DHT traffic dump for id20:-qB and didn't find any matches. It's not common enough that it would show up immediately at least.

I assume this structure is also appropriate here as you'd want to prefer knowing about hashes close to your own ID.

Actually you only want this for the main routing table which is also used to respond to incoming queries so you can supply them with a high-resolution view of your vincinity (in case they're looking for something there).

For random sampling of info-hashes you don't really need any locality. You either need a structured sweep over the entire keyspace or a sort of random walk, neither of which need any locality relative to your node ID. A random walk is likely going to be inefficient though due to the interval restriction if you end up hitting the same node multiple times (e.g. because it's popular for some reason or there are some biases in the method).

mgdigital commented 7 months ago

didn't find any matches

I think I know how this happened - I've been assuming the client ID from the BT handshake for the meta info request would be the same client ID used on the DHT which I'm now guessing is wrong.

you only want this for the main routing table which is also used to respond to incoming queries so you can supply them with a high-resolution view of your vincinity For random sampling of info-hashes you don't really need any locality.

But sample_infohashes is meant to return only info hashes for which you're able to service a get_peers request, which in turn would be taken as a random sample from the locally biased high res view.

the8472 commented 7 months ago

I think I know how this happened - I've been assuming the client ID from the BT handshake for the meta info request would be the same client ID used on the DHT which I'm now guessing is wrong.

I did explicitly say that it's part of the TCP peer protocol and tracker announces...

But sample_infohashes is meant to return only info hashes for which you're able to service a get_peers request, which in turn would be taken as a random sample from the locally biased high res view.

That's for servicing incoming requests, which are serviced from your local announce storage which will contain things based on your node ID. I was talking about what state a node has to maintain to find new contacts to which to send outgoing sample_infohashes requests. You can't do that (efficiently) based on your local routing table because the routing table should be biased towards your node ID, so you have to maintain separate lookup state with candidate lookups. Note that the sample-infohashes RPC also contains the usual target/nodes mechanism besides the samples.

mgdigital commented 6 months ago

Hi @the8472 , this PR will be merged imminently if there's no further feedback, after which I'll close this issue. I've addressed the points discussed above (no longer evicting nodes from the table and am using a replacement cache). I'll open a further issue for servicing meta info requests.

the8472 commented 6 months ago

Just from skimming over the pR it looks like the basics are there. I have not looked close enough to check for correctness.

Just to probe the previous misunderstanding: You have a routing table to service incoming requests... and then what do you do to send sample infohashes requests yourself? Just occasionally pull a few nodes from the routing table + whatever contacts you see from incoming requests?

mgdigital commented 6 months ago

what do you do to send sample infohashes requests yourself? Just occasionally pull a few nodes from the routing table + whatever contacts you see from incoming requests?

Basically correct yep, the nodes for sending find_node and sample_infohashes requests come from a combination of:

the8472 commented 6 months ago

Yeah ok, sounds reasonable.

If you haven't already then you may want to keep track of which nodes you've recently sent a sample request to so you don't spam a node that contacts you repeatedly, e.g. because they're neighbors and send find-nodes to keep their routing table up to date. Depending on the interval they set you wouldn't get fresh samples out of them anyway.

mgdigital commented 6 months ago

you may want to keep track of which nodes you've recently sent a sample request to

Yes it is doing this already (for nodes in the routing table) and will also back off sending sample requests to any node that didn't return any unknown hashes in the last request - incidentally I found it excessive that most nodes seem to return the maximum interval of 6 hours even though they report having a high Num of samples available, but anyway we generally discover enough new nodes to keep sampling from. When querying nodes from the routing table it will prefer the least recently responded; separately from this there is also outgoing rate limiting of 1 request per IP per second (which we wouldn't be hitting for any sustained period anyway) - so we shouldn't be spamming individual nodes...