PaulLereverend / NextcloudDuplicateFinder

Save some space by finding your duplicate files
GNU Affero General Public License v3.0
79 stars 16 forks source link

Out of memory error on command line, endless spinner in browser #6

Open nfriedly opened 4 years ago

nfriedly commented 4 years ago

I'd love to use this app, but I'm having trouble, possibly related to having a quite large amount of data.

My setup is an Unraid server running the NextCloud docker image from linuxserver.io. Docker and NextCloud, as well as my user data (only a few megabytes) are all on an SSD, but I have the External Storage app configured with a local share from unraid that is ~25TB.

When I go to the web UI, I just get the spinner forever, like in #1. If I open the browser's dev tools I can see that there's a request to /apps/duplicatefinder/files that gets a 504 Gateway Timeout failure from nginx/1.18.0 after a minute or so.

So, I tried opening a shell in the docker image and running the occ duplicates:find-all (after figuring out the correct prefix from #2)

That took a few minutes and then failed like so:

occ duplicates:find-all
PHP Fatal error:  Allowed memory size of 536870912 bytes exhausted (tried to allocate 20480 bytes) in /config/www/nextcloud/lib/private/Files/Cache/Cache.php on line 175

(536870912 bytes is about half a gigabyte)

While it was running, I didn't hear any hard drives spin up, although the CPU was pegged to basically 100%. Then I noticed it was still pegged after the command failed, with two instances of php7 -f /config/www/nextcloud/cron.php running at 50% load each.

I'm not really sure if cron.php was related to duplicate finder, but it seems plausible.

I restarted the docker instance and ran occ duplicates:find-all - this time the CPU load stayed lower, around 30-40%, and there was some disk activity, but it still ended with the same out of memory error as above. This time, the CPU load returned to near-0 when it finished.

I'm fairly new to NextCloud, but if there's anything you'd like me to try or logs you'd like to see, please let me know.

daniel-a-h commented 4 years ago

I have the same problem. I've seen yesterdays MR trying to reduce the memory usage but I don't think that approach is going to work. I think the hashes need to be stored somewhere else until it's time for comparison.

A MD5-Hash is 128 bit. Asuming we store nothing but the hash in memory that's 64 hashes/files per MB and 'only' 33k Files for half a gig where most PHP-Installations will throw the error we see above. Since we're also storing other stuff in the arrays and have other overhead the actual number is probably half that.

I'd love to use this app but I'm working with closer to 1 Million files, not a couple of thousand.

We could save the hashes in the DB, in a dedicated table. Possibly use the oc_filecache primary key as foreign key and use the data from there in conjunction with the events \OCP\Files\Events* (because oc_filecache may contain already deleted files). That way the duplicate detection would be always up to date, no need to wait for it ever.

nfriedly commented 4 years ago

I think yesterday's changes are still an improvement, and FWIW, it looks like it doesn't hash every file, only ones where there are multiple files of the same file size.

But, your point still stands. Trying to store everything in memory is bound to hit the limit at some point, even with aggressive memory optimizations.

On the other hand, my server has 16GB of RAM, and I'd be perfectly happy to let this thing use the majority of it for a day or two while it churns through the filesystem. That coupled with some memory optimizations might actually be good enough.

water-man commented 4 years ago

I don't know about the technicalities, but consider this as a quick work-around to get results.

Yes, I know, this omits the doubles between the different chunks, but I found that it was very predictable to determine the different areas where the duplicates would be.
Hope this helps somebody!

GTP95 commented 3 years ago

I solved this problem by increasing the PHP memory limit specified inside the "memory-limit.ini" file (I don't remember the file's path and also note that I'm using a Docker container, if you're using a different setup maybe you have to edit php.ini instead). But the application should signal this problem via the web UI, otherwise you keep waiting for it to complete the scan while it will never end.

jth134 commented 3 years ago

I solved this problem by increasing the PHP memory limit specified inside the "memory-limit.ini" file (I don't remember the file's path and also note that I'm using a Docker container, if you're using a different setup maybe you have to edit php.ini instead). But the application should signal this problem via the web UI, otherwise you keep waiting for it to complete the scan while it will never end.

What did you raise the limit too? Did you double it?

Dulanic commented 3 years ago

I found this issue due to running into this myself. Trying to find a workaround.

spaquin3 commented 3 years ago

was able to pass the 504 time-out error with those config in my reverse proxy .conf files /etc/nginx/conf.d/ create custom_proxy_settings.conf and add those setting (not sur with one fix the problem)
client_max_body_size 10g; proxy_connect_timeout 600s; proxy_send_timeout 600s; proxy_read_timeout 600s; fastcgi_send_timeout 600s; fastcgi_read_timeout 600s;

Dulanic commented 3 years ago

I was hoping PR https://github.com/PaulLereverend/NextcloudDuplicateFinder/pull/25 would help this. But it didn't :( Mine still errors out even /w CLI.

root@93678e3fc2a1:/# occ duplicates:find-all
Start scan... user: *redacted*
No duplicate file
...end scan
Start scan... user: *redacted*
No duplicate file
...end scan
Start scan... user: *redacted*
No duplicate file
...end scan
Start scan... user: *redacted*
No duplicate file
...end scan
Start scan... user: *redacted*
PHP Fatal error:  Allowed memory size of 536870912 bytes exhausted (tried to allocate 4096 bytes) in /config/www/nextcloud/lib/private/Files/View.php on line 185
nfriedly commented 3 years ago

PR #25 was reverted - https://github.com/PaulLereverend/NextcloudDuplicateFinder/commit/c020135d573e309ff791940c36be6adf6f2b20f8

I expect it will resolve the issue whenever it lands "for real"