JohnDoee / autotorrent2

Cross-seed matching and torrent lifecycle tool
https://johndoee.github.io/autotorrent2/
MIT License
142 stars 10 forks source link

Performance - Perform a multiline insert into the SQLite DB or only insert newly found files #14

Closed newadventure079 closed 2 years ago

newadventure079 commented 2 years ago

DB calls can be "expensive", so rather than performing an insert one at a time (which is the slowest method) for files that are found, a multiline insert would yield tremendous performance improvements.

Instead of a single INSERT OR IGNORE INTO files (name, path, size, normalized_name) VALUES (?, ?, ?, ?) sql statement, you could insert multiple at once

INSERT OR IGNORE INTO files (name, path, size, normalized_name) VALUES
(?, ?, ?, ?),
(?, ?, ?, ?),
(?, ?, ?, ?),
(?, ?, ?, ?)

You could batch insert every 1,000 or 5,000 or whatever

https://stackoverflow.com/questions/1609637/is-it-possible-to-insert-multiple-rows-at-a-time-in-an-sqlite-database

Inserting one row at a time might be why my scan times went from 10 mins to 60 mins

OR you could only insert into the DB new files. So you would select * from the DB and put that into a dict (comparing to a key in a dict is much faster than running a select statement for each file comparison), and when you come across a file, compare the filename, filepath and filesize and if it's in the dict, then it's already in the DB, and therefore you dont need to perform an insert. With this method only brand new files are inserted and would cut down on the number of inserts dramatically

newadventure079 commented 2 years ago

I just noticed there's tons of these calls too when adding a torrent

DEBUG:autotorrent.db:179:Doing query: 'SELECT name, path, size, normalized_name, unsplitable_root FROM files WHERE normalized_name = ? AND size = ?' with args: ['0293.jpg', 189127]
DEBUG:autotorrent.db:179:Doing query: 'SELECT name, path, size, normalized_name, unsplitable_root FROM files WHERE normalized_name = ? AND size = ?' with args: ['0294.jpg', 212266]
DEBUG:autotorrent.db:179:Doing query: 'SELECT name, path, size, normalized_name, unsplitable_root FROM files WHERE normalized_name = ? AND size = ?' with args: ['0295.jpg', 212786]
DEBUG:autotorrent.db:179:Doing query: 'SELECT name, path, size, normalized_name, unsplitable_root FROM files WHERE normalized_name = ? AND size = ?' with args: ['0296.jpg', 218251]
DEBUG:autotorrent.db:179:Doing query: 'SELECT name, path, size, normalized_name, unsplitable_root FROM files WHERE normalized_name = ? AND size = ?' with args: ['0297.jpg', 194618]

These could also be grabbed in a single multi-select DB call for each torrent file encountered.

Basically, you dont want to perform a single sql statement inside of a loop. It's a recipe for poor performance

JohnDoee commented 2 years ago

I offload both multi-insert and new file discovery to sqlite instead of doing them by hand in Python.

Sqlite in Python has auto-commit disabled per default and I commit every https://github.com/JohnDoee/autotorrent2/blob/master/src/autotorrent/db.py#L11 which has the same effect as doing the insert you described. You can try setting it 1 and see how fast it is.

There are two things I think slows down your scan

Currently it does not have deleted file support either on a normal scan, to add that I'll have to add another thing I wanted to avoid. The hash bucket queue stuff.

flowchart TD

    A[Disk scanner /mnt/sda] -->|Put paths in queue| C(Path hasher)
    B[Disk scanner /mnt/sdb] --> |Put paths in queue| C
    C --> |Hash each path and put in a bucket depending on prefix| C
    C --> |Flush a bucket when it is too full| D(Database saver)
    D --> |Pull current bucket for hash prefix from DB and find differences| D
    D --> |Update changes to database| E(Database)

Each row in the flow chart is its own thread and there is a queue between all of them. This is both memory efficient and very speedy. You know the differences so you can pull the file size only for the new files (and assume files do not change in size).

This is the algorithm I used in another project.

In contrast the current setup looks like this

flowchart LR

A[Scan disks] --> |Put paths in database transaction| C(Database)
C --> |Flush database every 10000 query|C

Which is very simple.

To me it seems more efficient to let users optimize in their own end with at2 scan -p /mnt/data/new-data-here and deleting torrents where there is no expectation of match, i.e. keep your torrent file folder slim.

What I am working on is a feature where you can compare a whole torrent site with all your local data and find potential torrents to seed. I.e. intersect 1mio+ local files with 1mio+ torrents in an efficient way.

DEBUG:autotorrent.db:179:Doing query: 'SELECT name, path, size, normalized_name, unsplitable_root FROM files WHERE normalized_name = ? AND size = ?' with args: ['0293.jpg', 189127]

When I benchmark the sqlite3 queries I make each query takes 0.005ms - it's of course a benchmark and not completely real workload but the actual query time is neglectable. That's against a database with 80k files though.

TL;DR I might put the path scans in threads but all the other stuff feels too much.

newadventure079 commented 2 years ago

OK, sounds good. Thanks for all your great work. What I can also do is if I add a new folder is run at2 scan -p /mnt/data/new-data-here instead of a full data scan. I had forgotten the -p switch was added to at2.

JohnDoee commented 2 years ago

I moved the disk scan to a threaded design. It was fast to do and I hope that it'll speed up the scan.

If you can test the performance of the current master branch then I would be very happy.

newadventure079 commented 2 years ago

I tested it and it was 10mins slower. It feels faster tho. My server is busy right now (although I ran the test of 1.0.3 and master while it was busy), so it might take a few days to get a more accurate result. If you tested it and it's faster, I'm sure it'll be fine

JohnDoee commented 2 years ago

Added the thread change in 1.1.0

Most of the performance gains, for me, was in multi-disk scan. In my tests ls -R was only a little bit faster than both old and new scan and that one doesn't do stat as far as i know.