eikek / docspell

Assist in organizing your piles of documents, resulting from scanners, e-mails and other sources with miminal effort.
https://docspell.org
GNU Affero General Public License v3.0
1.56k stars 117 forks source link

Suggestion - Add API for files in processing queue by checksum #1627

Open Snify89 opened 2 years ago

Snify89 commented 2 years ago

When uploading a bunch of (multiple) files to docspell, you may use the API "checkfile-checksum-by-id" to see, if a document is already (processed) in docspell. However during the upload, there may files already in queue for processing. So the API "checkfile-checksum-by-id" returns false and a possible duplicate document (which is already processing or in queue), can be uploaded as long as the item is not completely processed.

Checking a checksum beforehand (which is in the queue/processed), can prevent to upload the same file multiple times.

eikek commented 2 years ago

Here I'm not so sure how to do it. The queue currently doesn't really know what kind of jobs are executing (it is possible of course, but might be not so easy). Another option could be to preprocess files with some tool like fdupes or czkawka or move it into the client…

Snify89 commented 2 years ago

As soon as Docspell receives a(ny) document, it should be hashed at first. Then, either the document is being processed right away (no matter how) or it is in (some) queue for later processing (no matter when). In fact, this document is either on it's way to become an item/attachment or it is being "discarded" because of a duplicate item/attachment. I don't care about what the processing is (not) doing. I am interested if a checksum of a file is being processed/in queue (or not).

Just think of a folder with 5 (big) files, with the same hash, being uploaded one by one. I can check beforehand, if the checksum of these files is an item already, which is fine. But while the first file is being processed (and not an item yet), the (same) second file gets uploaded and so on. So by most, 5 files of the same hash are being processed, until 1 of the jobs becomes finally an item.

This shouldn't be too hard to implement?! Another API endpoint which queries a table with the saved hash of uploaded files (in queues)?! I might be wrong tho... No offense ^_^

Of course, there is still the possibility to clear duplicates first etc. But with joex "nodes" architecture, etc. this could be useful for (any) custom clients for docspell.

Also, joex should internally avoid processing more than 1 file of the same hash and keep track of them.

eikek commented 2 years ago

To me it doesn't make so much sense to send 5 (large) files over the network only to drop 4 of them. Could be done perfectly fine before sending them. Also, hashing is a very expensive operation. It requires to read the entire file. Currently the hash is indeed computed on upload - but I wanted to move this to the job executor. It is not good if files get large.

Currently the job is also doing a duplicate check. So even if there are 5 same files in the queue, if they get processed sequentially only one will survive.

Snify89 commented 2 years ago

The large files should only be an example timewise. The size doesn't matter in this example, as long as an upload is quicker than the processing itself. I would first hash any file, that comes to Docspell (no matter the source), then lookup if the hash is already being processed and discard the queue/duplicate, if the same hash is found or if the file is already an item/attachment (or something like that)

Currently: If you upload 2 NEW files with the same hash (but different filenames) via the UI, both get processed and both get added to Docspell with different item IDs

Btw. I use pool-size = 2 at the moment.

eikek commented 2 years ago

Currently: If you upload 2 NEW files with the same hash (but different filenames) via the UI, both get processed and both get added to Docspell with different item IDs

Btw. I use pool-size = 2 at the moment.

It won't if you use pool-size=1 :)

The size of a file matters resource wise. The upload will take longer (waiting for the checksum to compute) and block the client. Then you will also have this race condition, if you upload two files in parallel.

A uploaded file is likely to stay and so it is stored at first (it must also be stored to get the checksum) and then a job is submitted which will eventually do all the processing, including duplicate checks based on the checksum. It might remove the file in this process. Currently it doesn't look into the queue, because of how it is constructed it is not very efficient to do this at this point.

For me the use case described here still sounds more like a bulk-upload that you probably do only once (or not often) at the beginning? I think the far more common case is not uploading same files - and so I wouldn't want to wait for the checksum in all cases. I think it is safer and faster if you just remove your duplicates before doing the bulk-upload. Tools like mentioned above are pretty good at that and you won't suffer from race conditions this way and also don't transfer files over the network that are already known to be duplicates.