JorenSix / Panako

The Panako acoustic fingerprinting system.
GNU Affero General Public License v3.0
179 stars 35 forks source link

Should the DB Index be in RAM? Big collection - multiple indexes? #23

Closed james-cook closed 2 years ago

james-cook commented 2 years ago

I am trying to index a very large collection of audio programmes (90000 items, ca. 1/2 hr per programme) This collection has grown over 20 years - and my intention is to weed out the many duplicates.

I have started indexing using the standard settings. olaf db and cache are at ca. 10GB each for 4500 files analysed so far. This suggests and overall index size of 200GB each.... (I know reducing the sample rate would reduce the index size).

I have 16GB RAM on the current server, with the same again as swap. The server has about 2TB free space left.

My question concerns querying/building the index. I can imagine it should/must be in RAM - is this correct?

If this is the case I was thinking of breaking the index to ca. 10GB fragments. i.e. scan to an index size of ca. 10GB then stop and restart with a fresh index for the next load of files (e.g. at the current rate scan 4500 files for each index). I would slightly change the config before continuing each run of panako to make this happen. This has the "advantage" also of being runnable on multiple machines (all accessing the same directory but different selections of files). The disadvantage would be having to query against 20 or so indexes (though this can be automated using bash).

Would this approach make sense? Or I my assumptions about RAM and the index incorrect - and an index file of 200GB is fine?

Thanks for any insights

JorenSix commented 2 years ago

Hey,

Only a small part of the index is cached in RAM. With a fast SSD query speeds should be still be fine even with such a large index.

Note that I have never tested 45 000h of audio in a single index (I did go to about 7000h) but since the index is a tree based structure (and search is log n) the additional hours should not be a problem.