bitmagnet-io / bitmagnet

A self-hosted BitTorrent indexer, DHT crawler, content classifier and torrent search engine with web UI, GraphQL API and Servarr stack integration.
https://bitmagnet.io/
MIT License
2.01k stars 76 forks source link

Document disk space requirements #186

Open nodiscc opened 3 months ago

nodiscc commented 3 months ago

Is your feature request related to a problem? Please describe

The documentation should mention the expected disk space requirements when the DHT crawler is enabled, relative to the number of torrents indexed, since this is by far the most demanding system requirement, and as it is dictated by factors outside the control of the user (total size of the bittorrent DHT... is there an estimate for this somewhere?)

Describe the solution you'd like

Document on https://bitmagnet.io/faq.html a few examples of DB sizes relative to the number of torrents, for example:

Disk space used by the database depends on the total number of indexed torrents and will tend to grow indefinitely. For example, these are the requirements you should expect for:

  • 50k indexed torrents: ???GB
  • 143k indexed torrents: 2.5GB
  • 1 million indexed torrents: ???GB

The value of 2.5GB for 143k torrents is from a measurement on my test instance. This puts the average size of a torrent at ~18KB. It would be interesting to see the numbers from instances with a lower/higher number of indexed torrents, and use that as an estimate.

I will report a separate issue about a potential setting to hard-limit the DB size.

Describe alternatives you've considered

Documenting the expected disk space requirements related to total run time, since the number of indexed torrents depends on the total time spent crawling.

Additional context

Somewhat related to https://github.com/bitmagnet-io/bitmagnet/issues/70 which would help keep the database size in check.

Nicolaj-H commented 3 months ago
  • [x] I have checked the existing issues to avoid duplicates
  • [x] I have redacted any info hashes and content metadata from any logs or screenshots attached to this issue

Is your feature request related to a problem? Please describe

The documentation should mention the expected disk space requirements when the DHT crawler is enabled, relative to the number of torrents indexed, since this is by far the most demanding system requirement, and as it is dictated by factors outside the control of the user (total size of the bittorrent DHT... is there an estimate for this somewhere?)

Describe the solution you'd like

Document on https://bitmagnet.io/faq.html a few examples of DB sizes relative to the number of torrents, for example:

Disk space used by the database depends on the total number of indexed torrents and will tend to grow indefinitely. For example, these are the requirements you should expect for:

  • 50k indexed torrents: ???GB
  • 143k indexed torrents: 2.5GB
  • 1 million indexed torrents: ???GB

The value of 2.5GB for 143k torrents is from a measurement on my test instance. This puts the average size of a torrent at ~18KB. It would be interesting to see the numbers from instances with a lower/higher number of indexed torrents, and use that as an estimate.

I will report a separate issue about a potential setting to hard-limit the DB size.

Describe alternatives you've considered

Documenting the expected disk space requirements related to total run time, since the number of indexed torrents depends on the total time spent crawling.

Additional context

Somewhat related to #70 which would help keep the database size in check.

At 884K indexed my Postgres data/base directory is at 6.7G

DyonR commented 3 months ago

One inportant thing to note is that the size is heavily influenced by the amount of file data you store. Example:
If you store 10,000 torrents and only 1 filename for each torrent, that's an additional 10,000 records, total: 20,000. Now change this to 100 files names per record and your database now has 1,010,000 records compared to the 20,000. Of course these stats of database size could be listed as the default configured file info size.

mgdigital commented 3 months ago

Currently on the FAQ page we have:

You should allow roughly 50GB of disk space per 10 million torrents, which should suffice for several months of crawling, however there is no upper limit to how many torrents might ultimately be crawled.

I agree better documentation on this would be good - but at the moment things are changing at a rapid pace that will affect disk space usage, and we're just getting to the stage where people have had it running long enough to get some better numbers about the current implementation. The next thing will be rule based workflows that can auto-delete and do other things that will affect this - so maybe we should come back to this in a few months and aim to make some better docs on this when things are more stable?

Nicolaj-H commented 3 months ago

One inportant thing to note is that the size is heavily influenced by the amount of file data you store. Example: If you store 10,000 torrents and only 1 filename for each torrent, that's an additional 10,000 records, total: 20,000. Now change this to 100 files names per record and your database now has 1,010,000 records compared to the 20,000. Of course these stats of database size could be listed as the default configured file info size.

That is correct, but a estimate with the default settings would be suffices to start with, and it would give you an estimate as to what hardware you would need to at least just test and play around with it.

nodiscc commented 3 months ago

At 884K indexed my Postgres data/base directory is at 6.7G

This puts the average torrent size at 6,7×1024×1024÷228000 = ~30KiB vs my estimated 18KiB

that the size is heavily influenced by the amount of file data you store

Yes, hence the need to use averages which are useful for estimation. People with a higher number of indexed torrents should have averages closer to reality/less bias. It would be interesting to compare between databses with about the same number of torrents.

there is no upper limit to how many torrents might ultimately be crawled.

That was my guess, hence https://github.com/bitmagnet-io/bitmagnet/issues/187

https://bitmagnet.io/faq.html#what-are-the-system-requirements-for-bitmagnet

I did not see this section, it was right under my nose /facepalm, however

roughly 50GB of disk space per 10 million torrents

This puts the average torrent size at 5.2KiB... why so much difference between our 3 measurements? I think more samples are needed.

mgdigital commented 3 months ago

I've checked just now and am on 67GB for 13.5 million torrents. A couple of things to bear in mind:

kde99 commented 2 months ago

For me:

Meaning:

Though I feel that disk IO throughput is more a limiting factor than disk size when you use HDDs. Had a DB much bigger and was struggling to keep up writes.

bitmagnet=# select pg_size_pretty(pg_database_size('bitmagnet'));
 pg_size_pretty
----------------
 4145 MB
(1 row)

bitmagnet=# select count(*) from torrents;
 count
--------
 528570
(1 row)

bitmagnet=# select count(*) from torrent_files;
  count
---------
 7086417
(1 row)
Aaron2550 commented 2 months ago

I'm at 78 GB for 7.059.136 Torrents

nodiscc commented 2 months ago

more space is used at the start - once most of the popular stuff from TMDB is stored locally this should level off

I did not think about that, there is some database space used for TMDB data

Are we measuring the same way?

I was relying on netdata postgresql bd size monitoring, but it's consistent with the results I get from select pg_size_pretty(pg_database_size('bitmagnet'))

Thanks everyone for the metrics, I will start a table below and update it every time someone posts their db stats. After a while it could be added to the documentation, hopefully.

number of torrents db size (GB) average per torrent (KB) notes
143 000 2.5 17
528 000 4.1 7.8
884 000 6.7 7.6
985 847 145 15.9 DHT_CRAWLER_SAVE_FILES_THRESHOLD=500000
7 059 136 78 11
9 228 000 145 7.6 DHT_CRAWLER_SAVE_FILES_THRESHOLD=500000
13 500 000 67 5
leofidus commented 2 months ago

To add another data point, I have 9 228 000 torrents with a total of 291 283 000 files, stored in 145 GB, using the config option DHT_CRAWLER_SAVE_FILES_THRESHOLD=500000 (to ensure file information is stored even on excessively large torrents, default cutoff is to store at most 100 files per torrent).

This means I have 31 files per torrent on average, over twice what kde99 got above. The largest torrent in my database contains 10870 files. 4.5% of torrents exceed the default DHT_CRAWLER_SAVE_FILES_THRESHOLD of 100 files.

Average size per torrent is correspondingly a bit larger at 16KB/torrent, or 535 bytes per file.

I agree that disk throughput is a much bigger factor. If you are using cheap consumer SSDs you also really feel the wear Bitmagnet puts on the disk. If I'm interpreting my disk stats correctly Bittorrent has written a total of about 180TB in service of creating this 145GB database.

orzFly commented 1 month ago

I have 985 847 torrents with a total of 29 345 871 files, stored in 15 736 869 347 bytes, using the config option DHT_CRAWLER_SAVE_FILES_THRESHOLD=500000, having a average size of 15 962 bytes per torrent.