Add DHT indexer filters

anacrolix commented 1 month ago

From https://github.com/anacrolix/cove/issues/10#issuecomment-2260581015. Filter what gets added to the DHT index. @barolo

barolo commented 1 month ago

using common regex would be best probably. I've gathered quite a few of them already when testing dhtc.

barolo commented 1 month ago

which would also solve unwanted languages, at least some, since it's possible to regex out Unicode ranges

barolo commented 1 month ago

ability to apply them retroactively on DB would be nice too

anacrolix commented 1 month ago

Great, are the filters on file names, torrent names or something else?

There is already a little filtering on seeder count, which is a huge filter for quality.

barolo commented 1 month ago

preferably both, dhtc does both and I found that very useful as I care about restricted range of formats. Additionally, it's useful to filter out old/legacy stuff (which some countries use due to limited bandwidth as one example, or due to weird habits..).

barolo commented 1 month ago

Also I would like you to answer if you do any illegal content filtering internally (I'm assuming that you do), it's a major concern and both dhtc and bitmagnet do it.

anacrolix commented 1 month ago

As in you need to ensure there's filtering, or you don't want a filter? What's the concern?

barolo commented 1 month ago

As in you need to ensure there's filtering, or you don't want a filter? What's the concern?

Obviously, I wouldn't want to keep the very illegal content in the index (like child abuse stuff)...

barolo commented 1 month ago

I have another question, there are some torrents with thousands of files in them. So I'd magine that checking them can slow down things considerably. Is there any possibility of limiting checking/crawling them (the big ones)?

anacrolix commented 1 month ago

Yeah by default the indexer refuses to add infos that exceed a certain size. Additionally I think the number of files in a torrent weighs negatively on how strongly it matches a search.

barolo commented 1 month ago

Yeah by default the indexer refuses to add infos that exceed a certain size. Additionally I think the number of files in a torrent weighs negatively on how strongly it matches a search.

I've just stumbled up a terabyte torrent with tens of thousands files inside with it's whole content indexed... That's not good. What's the limit?

barolo commented 1 month ago

The problem is that this particular torrent is actually valuable, but loading it fully is highly resource hungry. It also has two levels of subfolders. And the name of the torrent is wholly nondescript, usually contains from one to four letters, so without crawling either files or subfolders it's not possible to get any valuable meta... Also, the torrent file itself, from these massive one can be 5-10MB in size...

Anyway, I've seen few of those which would be excluded with my filters, and would save some processing time definitely.

anacrolix commented 1 month ago

That's interesting. In my database the The largest stored info is 3MB. It's compressed down from 14.5MB. And it's not unpopular stuff, it quickly drops down to just over 1MB for the larger infos. The average compression is around 25%.

I checked the code and the limit is 100MiB. So if the metainfo is 100MiB or more on the wire, the indexer won't bother fetching it. The hash data is thrown away for v1 torrents, so it might be that the largest infos here were significantly larger with the hash data.

Here's the raw data if you're interested

+-------------------+----------+-----------+---------+------------+
|        a/b        |    a     |     b     | seeders | file_count |
+-------------------+----------+-----------+---------+------------+
| 0.205589256054437 | 2.9729   | 14.460386 | 4       | 111701     |
| 0.191750191516944 | 2.500792 | 13.041927 | 92      | 104431     |
| 0.183786494534063 | 1.707072 | 9.288343  | 8       | 112212     |
| 0.251504958904418 | 1.673055 | 6.652175  | 0       | 59772      |
| 0.209207257029524 | 1.65587  | 7.914974  | 4       | 80857      |
| 0.171661471037849 | 1.636341 | 9.532372  | 0       | 126776     |
| 0.524022050457096 | 1.549267 | 2.956492  | 7       | 36395      |
| 0.182204938628354 | 1.45907  | 8.007851  | 2       | 95505      |
| 0.514530407225356 | 1.458474 | 2.834573  | 1       | 41764      |
| 0.263741551349621 | 1.371402 | 5.199795  | 2       | 46268      |
| 0.318458899275291 | 1.331034 | 4.17961   | 1       | 51697      |
| 0.269195629008345 | 1.328527 | 4.935173  | 0       | 53592      |
| 0.32190846034417  | 1.261651 | 3.919285  | 1       | 56519      |
| 0.312036648534522 | 1.25385  | 4.018278  | 19      | 31001      |
| 0.423762693436988 | 1.24336  | 2.934095  | 0       | 24615      |
| 0.353076277549578 | 1.215088 | 3.441432  | 38      | 70119      |
| 0.599473579558578 | 1.19389  | 1.991564  | 0       | 23293      |
| 0.500415743265555 | 1.175981 | 2.350008  | 1       | 33412      |
| 0.358333126731059 | 1.170727 | 3.267147  | 2       | 31112      |
| 0.350919751250581 | 1.152797 | 3.285073  | 2       | 31626      |
+-------------------+----------+-----------+---------+------------+

a is the compressed MB for the file listing for a torrent, b is the uncompressed size. a/b is the compression ratio.

I assume you're poking around in the torrent database attached to cove.

anacrolix commented 1 month ago

The problem is that this particular torrent is actually valuable, but loading it fully is highly resource hungry. It also has two levels of subfolders. And the name of the torrent is wholly nondescript, usually contains from one to four letters, so without crawling either files or subfolders it's not possible to get any valuable meta... Also, the torrent file itself, from these massive one can be 5-10MB in size...

Anyway, I've seen few of those which would be excluded with my filters, and would save some processing time definitely.

Please send me the infohash(es) for any torrents that cause you problems with cove. I've had a few in the past that required optimization to handle their enormous sizes, and having realworld jumbo torrents that cause issue in my implementations is really helpful.

barolo commented 1 month ago

Please send me the infohash(es) for any torrents that cause you problems with cove. I've had a few in the past that required optimization to handle their enormous sizes, and having realworld jumbo torrents that cause issue in my implementations is really helpful.

I've sent you the hass via website's mail.

Sidenote. After more than a week of crawling, on average it indexes ~20 hashes per ~5 min. Is that normal? Seems really sluggish. Also, misses completely some chunks of common stuff.

anacrolix commented 1 month ago

No, you should get between 80 and 200 infos every 5 minutes. I think the actual count is much higher but the search index filtered out stuff that doesn't have many seeders.

barolo commented 1 month ago

80 and 200 infos every 5 minutes

Well it's been 18-28 for days. And it's a half-gig fiber link with very low latency. Edit, the issue was cause by the tunnel used, which did something to cause slow crawling, after disabling it it's back to normal.

And also, I really really need filters, there's unimaginable amount of useless stuff indexed.

barolo commented 1 month ago

After it got into full swing, I can finally tell that you aren't doing any internal filtering for the highly undesirable and very illegal content. So I'm stopping it until you do or there's filtering.

anacrolix / cove

Add DHT indexer filters #30