Open nodiscc opened 3 months ago
- [x] I have checked the existing issues to avoid duplicates
- [x] I have redacted any info hashes and content metadata from any logs or screenshots attached to this issue
Is your feature request related to a problem? Please describe
The documentation should mention the expected disk space requirements when the DHT crawler is enabled, relative to the number of torrents indexed, since this is by far the most demanding system requirement, and as it is dictated by factors outside the control of the user (total size of the bittorrent DHT... is there an estimate for this somewhere?)
Describe the solution you'd like
Document on https://bitmagnet.io/faq.html a few examples of DB sizes relative to the number of torrents, for example:
Disk space used by the database depends on the total number of indexed torrents and will tend to grow indefinitely. For example, these are the requirements you should expect for:
- 50k indexed torrents: ???GB
- 143k indexed torrents: 2.5GB
- 1 million indexed torrents: ???GB
The value of 2.5GB for 143k torrents is from a measurement on my test instance. This puts the average size of a torrent at ~18KB. It would be interesting to see the numbers from instances with a lower/higher number of indexed torrents, and use that as an estimate.
I will report a separate issue about a potential setting to hard-limit the DB size.
Describe alternatives you've considered
Documenting the expected disk space requirements related to total run time, since the number of indexed torrents depends on the total time spent crawling.
Additional context
Somewhat related to #70 which would help keep the database size in check.
At 884K indexed my Postgres data/base directory is at 6.7G
One inportant thing to note is that the size is heavily influenced by the amount of file data you store.
Example:
If you store 10,000 torrents and only 1 filename for each torrent, that's an additional 10,000 records, total: 20,000. Now change this to 100 files names per record and your database now has 1,010,000 records compared to the 20,000.
Of course these stats of database size could be listed as the default configured file info size.
Currently on the FAQ page we have:
You should allow roughly 50GB of disk space per 10 million torrents, which should suffice for several months of crawling, however there is no upper limit to how many torrents might ultimately be crawled.
I agree better documentation on this would be good - but at the moment things are changing at a rapid pace that will affect disk space usage, and we're just getting to the stage where people have had it running long enough to get some better numbers about the current implementation. The next thing will be rule based workflows that can auto-delete and do other things that will affect this - so maybe we should come back to this in a few months and aim to make some better docs on this when things are more stable?
One inportant thing to note is that the size is heavily influenced by the amount of file data you store. Example: If you store 10,000 torrents and only 1 filename for each torrent, that's an additional 10,000 records, total: 20,000. Now change this to 100 files names per record and your database now has 1,010,000 records compared to the 20,000. Of course these stats of database size could be listed as the default configured file info size.
That is correct, but a estimate with the default settings would be suffices to start with, and it would give you an estimate as to what hardware you would need to at least just test and play around with it.
At 884K indexed my Postgres data/base directory is at 6.7G
This puts the average torrent size at 6,7×1024×1024÷228000 = ~30KiB vs my estimated 18KiB
that the size is heavily influenced by the amount of file data you store
Yes, hence the need to use averages which are useful for estimation. People with a higher number of indexed torrents should have averages closer to reality/less bias. It would be interesting to compare between databses with about the same number of torrents.
there is no upper limit to how many torrents might ultimately be crawled.
That was my guess, hence https://github.com/bitmagnet-io/bitmagnet/issues/187
https://bitmagnet.io/faq.html#what-are-the-system-requirements-for-bitmagnet
I did not see this section, it was right under my nose /facepalm, however
roughly 50GB of disk space per 10 million torrents
This puts the average torrent size at 5.2KiB... why so much difference between our 3 measurements? I think more samples are needed.
I've checked just now and am on 67GB for 13.5 million torrents. A couple of things to bear in mind:
select pg_size_pretty(pg_database_size('bitmagnet'))
For me:
Meaning:
Though I feel that disk IO throughput is more a limiting factor than disk size when you use HDDs. Had a DB much bigger and was struggling to keep up writes.
bitmagnet=# select pg_size_pretty(pg_database_size('bitmagnet'));
pg_size_pretty
----------------
4145 MB
(1 row)
bitmagnet=# select count(*) from torrents;
count
--------
528570
(1 row)
bitmagnet=# select count(*) from torrent_files;
count
---------
7086417
(1 row)
I'm at 78 GB for 7.059.136 Torrents
more space is used at the start - once most of the popular stuff from TMDB is stored locally this should level off
I did not think about that, there is some database space used for TMDB data
Are we measuring the same way?
I was relying on netdata postgresql bd size monitoring, but it's consistent with the results I get from select pg_size_pretty(pg_database_size('bitmagnet'))
Thanks everyone for the metrics, I will start a table below and update it every time someone posts their db stats. After a while it could be added to the documentation, hopefully.
number of torrents | db size (GB) | average per torrent (KB) | notes |
---|---|---|---|
143 000 | 2.5 | 17 | |
528 000 | 4.1 | 7.8 | |
884 000 | 6.7 | 7.6 | |
985 847 | 145 | 15.9 | DHT_CRAWLER_SAVE_FILES_THRESHOLD=500000 |
7 059 136 | 78 | 11 | |
9 228 000 | 145 | 7.6 | DHT_CRAWLER_SAVE_FILES_THRESHOLD=500000 |
13 500 000 | 67 | 5 |
To add another data point, I have 9 228 000 torrents with a total of 291 283 000 files, stored in 145 GB, using the config option DHT_CRAWLER_SAVE_FILES_THRESHOLD=500000 (to ensure file information is stored even on excessively large torrents, default cutoff is to store at most 100 files per torrent).
This means I have 31 files per torrent on average, over twice what kde99 got above. The largest torrent in my database contains 10870 files. 4.5% of torrents exceed the default DHT_CRAWLER_SAVE_FILES_THRESHOLD of 100 files.
Average size per torrent is correspondingly a bit larger at 16KB/torrent, or 535 bytes per file.
I agree that disk throughput is a much bigger factor. If you are using cheap consumer SSDs you also really feel the wear Bitmagnet puts on the disk. If I'm interpreting my disk stats correctly Bittorrent has written a total of about 180TB in service of creating this 145GB database.
I have 985 847 torrents with a total of 29 345 871 files, stored in 15 736 869 347 bytes, using the config option DHT_CRAWLER_SAVE_FILES_THRESHOLD=500000
, having a average size of 15 962 bytes per torrent.
Is your feature request related to a problem? Please describe
The documentation should mention the expected disk space requirements when the DHT crawler is enabled, relative to the number of torrents indexed, since this is by far the most demanding system requirement, and as it is dictated by factors outside the control of the user (total size of the bittorrent DHT... is there an estimate for this somewhere?)
Describe the solution you'd like
Document on https://bitmagnet.io/faq.html a few examples of DB sizes relative to the number of torrents, for example:
The value of 2.5GB for 143k torrents is from a measurement on my test instance. This puts the average size of a torrent at ~18KB. It would be interesting to see the numbers from instances with a lower/higher number of indexed torrents, and use that as an estimate.
I will report a separate issue about a potential setting to hard-limit the DB size.
Describe alternatives you've considered
Documenting the expected disk space requirements related to total run time, since the number of indexed torrents depends on the total time spent crawling.
Additional context
Somewhat related to https://github.com/bitmagnet-io/bitmagnet/issues/70 which would help keep the database size in check.