bitmagnet-io / bitmagnet

A self-hosted BitTorrent indexer, DHT crawler, content classifier and torrent search engine with web UI, GraphQL API and Servarr stack integration.
https://bitmagnet.io/
MIT License
2.01k stars 76 forks source link

Database size optimizations #191

Open ioqy opened 3 months ago

ioqy commented 3 months ago

Is your feature request related to a problem? Please describe

While having a look at the database, I have found that by using a different foreign key or using a foreign key at all could save more than 10% of disk space.

Describe the solution you'd like

Every table that has a relation to a torrent is linked via the info-hash. By adding an ID field with the data type bigserial (or bigint) to the torrents table and using the new ID as a foreign key in the tables torrent_files and torrents_torrent_sources and removing the info_hash field from the tables (except in the torrents table of course), I saved more than 12% space in a test database I converted.

table records table size before [GB] table size after [GB] indexes size before [GB] indexes size after [GB] difference [GB] difference [%]
torrents 6,292,151 0.94 0.99 0.932 1.04 0.16 +8.44 %
torrent_files 30,949,330 3.9 3.45 4.82 3.86 -1.41 -16.17 %
torrents_torrent_sources 6,531,913 2.58 2.55 0.782 0.363 -0.45 -13.36 %
43,773,394 7.42 6.99 6.534 5.263 -1.70 -12.19 %

Of course the info_hash field should also be replaced in the other tables I didn't mention.

Maybe the info-hash in the field torrent_contents.id could also be replaced by the ID.

Additional context

Using a small foreign key field and storing the text/data in different table could also be used for other tables that store text that may seem small because it's only a few characters, but by having the same text a few million times adds up quickly.

Some other fields that I haven't tried but which could be replaced with a smallserial or smallint foreign key (which only takes up 2 bytes and the maximum value of 32,767 should suffice because the fields have very few unique values):

Maybe even torrent_hints.content_id with a normal integer, because there are more unique values.

I don't know how the bloom filters work, but maybe the unique values of the fields torrents_torrent_sources.bfsd and torrents_torrent_sources.bfpe could also be stored in a separate table and only be referenced by a foreign key. In my database the table torrents_torrent_sources has 6,531,913 records and the two bloom filter fields have only 20,513 and 19,745 unique values, which looks like a lot of duplicates to me.

mgdigital commented 3 months ago

Hi, changing the primary key for torrents to anything other than the info hash would be a complex rework with several downsides I can think of - the main being that the info hash is a globally recognised ID (unlike an auto-incrementing bigint), so can be used across a distributed system for things like deduplicating/merging data.

So there would have to be a lot of value-add to justify this, and I'm not convinced 10% saved disk space is worth it.

You've suggested a few other optimisations and I think each would need discussing on its own merits and understanding the dev work each would entail. If there are any low hanging fruit where we can make a big improvement with minimal impact then those are the kind of optimisations we'd want to find.

ioqy commented 3 months ago

Shouldn't the info hash be more or less an additional INNER JOIN torrents ON ... for all queries?

It was more than 12% on my test database and depending on the number of files per torrent it could be more, since each torrent file stores the info hash 3 times (1x in the column and 2x in indexes). So for torrents with many files this would save a lot more. And I guess when storing the torrent pieces it would be an even bigger improvement because there would be even more records per torrent.

IMHO not storing the same "large" value multiple times in many tables is a low hanging fruit and I would be surprised if there could be more space saved with less work.

mgdigital commented 2 months ago

Shouldn't the info hash be more or less an additional INNER JOIN torrents ON ... for all queries?

Yes this is possible and it would require updating nearly every single query in the application, and always having to join to that table would incur a performance hit. So it would be a ton of work for a marginal disk space saving and a reduction in performance.

I'd be surprised if 12% disk space couldn't be saved with much less disruptive changes, but at the moment performance improvements are more valuable to find than disk space savings.

rraymondgh commented 2 weeks ago

One thing I've done is partition torrent_files as have very limited SSD space. This then allows for moving older data (based on created_at) to a table space located on a HDD. Current testing shows this has limited impact on performance with benefit of moving ~30% of storage to HDD where space is not limited. I'll continue testing and possibly provide this as a solution after running this setup for 3 or 4 weeks. (At week 2 at moment). Will also look at torrent_contents. My database is ~4m torrents currently and growing by ~150k a day