Open anacrolix opened 1 month ago
using common regex would be best probably. I've gathered quite a few of them already when testing dhtc.
which would also solve unwanted languages, at least some, since it's possible to regex out Unicode ranges
ability to apply them retroactively on DB would be nice too
Great, are the filters on file names, torrent names or something else?
There is already a little filtering on seeder count, which is a huge filter for quality.
preferably both, dhtc does both and I found that very useful as I care about restricted range of formats. Additionally, it's useful to filter out old/legacy stuff (which some countries use due to limited bandwidth as one example, or due to weird habits..).
Also I would like you to answer if you do any illegal content filtering internally (I'm assuming that you do), it's a major concern and both dhtc and bitmagnet do it.
As in you need to ensure there's filtering, or you don't want a filter? What's the concern?
As in you need to ensure there's filtering, or you don't want a filter? What's the concern?
Obviously, I wouldn't want to keep the very illegal content in the index (like child abuse stuff)...
I have another question, there are some torrents with thousands of files in them. So I'd magine that checking them can slow down things considerably. Is there any possibility of limiting checking/crawling them (the big ones)?
Yeah by default the indexer refuses to add infos that exceed a certain size. Additionally I think the number of files in a torrent weighs negatively on how strongly it matches a search.
Yeah by default the indexer refuses to add infos that exceed a certain size. Additionally I think the number of files in a torrent weighs negatively on how strongly it matches a search.
I've just stumbled up a terabyte torrent with tens of thousands files inside with it's whole content indexed... That's not good. What's the limit?
The problem is that this particular torrent is actually valuable, but loading it fully is highly resource hungry. It also has two levels of subfolders. And the name of the torrent is wholly nondescript, usually contains from one to four letters, so without crawling either files or subfolders it's not possible to get any valuable meta... Also, the torrent file itself, from these massive one can be 5-10MB in size...
Anyway, I've seen few of those which would be excluded with my filters, and would save some processing time definitely.
That's interesting. In my database the The largest stored info is 3MB. It's compressed down from 14.5MB. And it's not unpopular stuff, it quickly drops down to just over 1MB for the larger infos. The average compression is around 25%.
I checked the code and the limit is 100MiB. So if the metainfo is 100MiB or more on the wire, the indexer won't bother fetching it. The hash data is thrown away for v1 torrents, so it might be that the largest infos here were significantly larger with the hash data.
Here's the raw data if you're interested
+-------------------+----------+-----------+---------+------------+
| a/b | a | b | seeders | file_count |
+-------------------+----------+-----------+---------+------------+
| 0.205589256054437 | 2.9729 | 14.460386 | 4 | 111701 |
| 0.191750191516944 | 2.500792 | 13.041927 | 92 | 104431 |
| 0.183786494534063 | 1.707072 | 9.288343 | 8 | 112212 |
| 0.251504958904418 | 1.673055 | 6.652175 | 0 | 59772 |
| 0.209207257029524 | 1.65587 | 7.914974 | 4 | 80857 |
| 0.171661471037849 | 1.636341 | 9.532372 | 0 | 126776 |
| 0.524022050457096 | 1.549267 | 2.956492 | 7 | 36395 |
| 0.182204938628354 | 1.45907 | 8.007851 | 2 | 95505 |
| 0.514530407225356 | 1.458474 | 2.834573 | 1 | 41764 |
| 0.263741551349621 | 1.371402 | 5.199795 | 2 | 46268 |
| 0.318458899275291 | 1.331034 | 4.17961 | 1 | 51697 |
| 0.269195629008345 | 1.328527 | 4.935173 | 0 | 53592 |
| 0.32190846034417 | 1.261651 | 3.919285 | 1 | 56519 |
| 0.312036648534522 | 1.25385 | 4.018278 | 19 | 31001 |
| 0.423762693436988 | 1.24336 | 2.934095 | 0 | 24615 |
| 0.353076277549578 | 1.215088 | 3.441432 | 38 | 70119 |
| 0.599473579558578 | 1.19389 | 1.991564 | 0 | 23293 |
| 0.500415743265555 | 1.175981 | 2.350008 | 1 | 33412 |
| 0.358333126731059 | 1.170727 | 3.267147 | 2 | 31112 |
| 0.350919751250581 | 1.152797 | 3.285073 | 2 | 31626 |
+-------------------+----------+-----------+---------+------------+
a is the compressed MB for the file listing for a torrent, b is the uncompressed size. a/b is the compression ratio.
I assume you're poking around in the torrent database attached to cove.
The problem is that this particular torrent is actually valuable, but loading it fully is highly resource hungry. It also has two levels of subfolders. And the name of the torrent is wholly nondescript, usually contains from one to four letters, so without crawling either files or subfolders it's not possible to get any valuable meta... Also, the torrent file itself, from these massive one can be 5-10MB in size...
Anyway, I've seen few of those which would be excluded with my filters, and would save some processing time definitely.
Please send me the infohash(es) for any torrents that cause you problems with cove. I've had a few in the past that required optimization to handle their enormous sizes, and having realworld jumbo torrents that cause issue in my implementations is really helpful.
Please send me the infohash(es) for any torrents that cause you problems with cove. I've had a few in the past that required optimization to handle their enormous sizes, and having realworld jumbo torrents that cause issue in my implementations is really helpful.
I've sent you the hass via website's mail.
Sidenote. After more than a week of crawling, on average it indexes ~20 hashes per ~5 min. Is that normal? Seems really sluggish. Also, misses completely some chunks of common stuff.
No, you should get between 80 and 200 infos every 5 minutes. I think the actual count is much higher but the search index filtered out stuff that doesn't have many seeders.
80 and 200 infos every 5 minutes
Well it's been 18-28 for days. And it's a half-gig fiber link with very low latency. Edit, the issue was cause by the tunnel used, which did something to cause slow crawling, after disabling it it's back to normal.
And also, I really really need filters, there's unimaginable amount of useless stuff indexed.
After it got into full swing, I can finally tell that you aren't doing any internal filtering for the highly undesirable and very illegal content. So I'm stopping it until you do or there's filtering.
From https://github.com/anacrolix/cove/issues/10#issuecomment-2260581015. Filter what gets added to the DHT index. @barolo