[Enhancement] Speed up server import process

TheSeventhCode commented 1 year ago

To give some context, I have a photo library of around 350k photos and videos. Luckily, Lychee can import things per symlink and skip duplicates, which makes it good for having these photos in just one location. The problem is the repeated importing. At least every two weeks, there are multiple new photos and videos added to that library. To show them in Lychee, an import from the server has to be run, which skips duplicates and makes symlinks.

The problem here is, that the checking of the duplicates takes too long. With the current implementation, it would take hours upon hours of just going through everything to see if it's already present and in the same condition.

I'm not sure what everything is done during the duplicate skip, but if it's just checking if the file already exists, it seems to be quite inefficient in that. If it does more, can that “more” be disabled? I don't “change” already present content, so the only thing that has to be checked for is new file paths.

I hope there is some solution to that; otherwise, it makes it difficult to use this tool for this kind of library sizes.

ildyria commented 1 year ago

I'm not sure what everything is done during the duplicate skip, but if it's just checking if the file already exists, it seems to be quite inefficient in that.

The way we check for duplicates is straightforward: compute the hash of the picture, search in the DB for any collision. There is not much that can be done in that regard. :(

A lighter way to check that would be to check the file name and verify that it does not already exist in the database. But this put a risk in the case of generic names like _R5_1234.jpg

Proposition:

option to only check against file names instead of computing the hash before checking for collision. This setting should be disabled by default and a CLI warning should be provided to the user. Do note that this optimization is only true if the hash take the most time, the DB query might be the choke point...

TheSeventhCode commented 1 year ago

That could work. The hashing function might reduce the import quite a bit. But how exactly is the DB built? I'm sure some things there could be sped up as well. Even if for every image there was made a DB query, if the complete file path or something like that is the primary key or an index in general, that shouldn't take too long, would it?

ildyria commented 1 year ago

Each image in Lychee is associated to a row in the photos table in the database. On import the checksum of the image is store in the database (so that we can check duplicated). And obviously the checksum column in that table is indexed to speed up the search.

There are multiple things that can take time:

computing the hash of the image.
executing the SQL query.

To give you an idea this is an illustration of the time required to access data on specific parts of your computer.

Cache L1 - There is a sandwich in front of you. Cache L2 - Walk to the kitchen and make a sandwich RAM - Drive to the store, purchase sandwich fixings, drive home and make sandwich HDD - Drive to the store. Purchase seeds. Grow seeds..... .... ... Harvest lettuce, wheat, etc. Make sandwich.

Even if those two may seem relatively fast as single event, when do you a 1 by 1 process it will still be slow in the end.

Unless we use a different strategy for such processing, the only gains to be done are by optimizing the sequential process. If PHP could be multi threaded, it would make it significantly easier as we could use parallelism over the list of images.

TheSeventhCode commented 1 year ago

Yeah, if multi-threading was available, I'm sure quite a few things could be sped up. How much do you think disk speed determines the import process? Like going from a normal SATA SSD to something like an M.2 or so.

In general, the different factors would be:

Disk speed (where the images are)
SQL-Queries
Hash function

I wonder now what would have the most impact.

Babyforce commented 1 year ago

Hello, I just wanted to add my 2 cents about this issue. I have a similar problem where syncing the files takes forever but not because of the number (which I am aware would increase the sync time no matter what) but rather the big files I host. I have quite a lot of files that are above 200MB each and the CPU the server uses is just an Intel Celeron. Computing the hash of those big files takes an enormous amount of time, so much that I would consider it to be a waste of time and energy (in those hard times where power costs much more).

I would think just checking the names instead of the checksum would be a huge gain of time and energy (while also not allowing lychee to find real file changes but my files are not subject to those changes so I could afford that).

If accuracy is an issue, maybe saving the exact number of byte per file in the database and comparing to the real files would be more efficient and be a bit more accurate? I would think it is extremely unlikely that a modified file would have the exact same number of bytes as the previous one (but maybe I'm wrong).

I started working on a dirty bash script for this purpose (that uses the API) and sadly I do not know PHP enough to even try adding this functionality myself. Not having such a feature is really blocking for me as I cannot start using Lychee at all unless I decide to manually add albums by hand which would be really tedious... Everything is hosted on a SATA SSD so it should be fast enough.

bushibot commented 1 year ago

Yeah I'm trouble importing large folders (thousand or two images). Every time it gets hung on the session times out or whatever it has to start over. Seems like it should maybe be able to track what was last processed and only restart for new items? Or items added since last run date?

ildyria commented 1 year ago

@bushibot Your use case would be better suited by using command line: https://github.com/LycheeOrg/Lychee/blob/master/app/Console/Commands/Sync.php

bushibot commented 1 year ago

@bushibot Your use case would be better suited by using command line: https://github.com/LycheeOrg/Lychee/blob/master/app/Console/Commands/Sync.php

I’d like to understand more about how to do that? It doesn’t seem quite as straightforward as open up the console and typing sync.php. Getting data in has already turned into a multi day project and it’s only getting worse… cli running the background might help it be more robust 😝

d7415 commented 1 year ago

I’d like to understand more about how to do that?

This should hopefully cover it: https://lycheeorg.github.io/docs/faq_general.html#can-i-set-up-lychee-to-watch-a-folder-for-new-images-and-automatically-add-them-to-albums

bushibot commented 1 year ago

I’d like to understand more about how to do that?

This should hopefully cover it: https://lycheeorg.github.io/docs/faq_general.html#can-i-set-up-lychee-to-watch-a-folder-for-new-images-and-automatically-add-them-to-albums

Cool, not sure how I missed that but thanks. That said how would Use with symlink rather then copy? I see it in the script, just not sure how to flag for it?

bushibot commented 1 year ago

I’d like to understand more about how to do that?

This should hopefully cover it: https://lycheeorg.github.io/docs/faq_general.html#can-i-set-up-lychee-to-watch-a-folder-for-new-images-and-automatically-add-them-to-albums

Or not so easy. I'm running on unraid docker image. I opned a console to test but got back

root@35539f6d2b63:/photodump# php artisan lychee:sync /Kittens
Could not open input file: artisan

x1ntt commented 7 months ago

我想了解更多关于如何做到这一点的信息？

这应该有望涵盖它：https://lycheeorg.github.io/docs/faq_general.html#can-i-set-up-lychee-to-watch-a-folder-for-new-images-and-automatically-add-them-to-albums

或者没那么容易。我在 unraid docker 映像上运行。我使用控制台进行测试，但又回来了
root@35539f6d2b63:/photodump# php artisan lychee:sync /Kittens
Could not open input file: artisan

To solve this problem, you should stay where the 'artisan' file is and pass in the absolute path of the path to be imported, like the following

root@038a06b0c059:/var/www/html/Lychee# php artisan lychee:sync /uploads/import/

LycheeOrg / Lychee

[Enhancement] Speed up server import process #1597