advplyr / audiobookshelf

Self-hosted audiobook and podcast server
https://audiobookshelf.org
GNU General Public License v3.0
6.39k stars 451 forks source link

[Bug]: Performance Issues Due to External Storage (CIFS) and Transient Inode Values #2509

Open babatonga opened 8 months ago

babatonga commented 8 months ago

Describe the issue

I've noticed that the partial scan in Audiobookshelf takes several hours to complete, and during this process, there is a high CPU load. For comparison, Plex only requires a few seconds for a similarly sized library. Both, audiobooks and Plex movies, are mounted through a CIFS mount. I perform manual partial scans after making changes.

Issue Details:

Environment:

Additional Information: The observed behavior is significantly different from the performance exhibited by Plex in a similar setup. The manual execution of partial scans is necessary after changes (because watcher can't watch smb shares). The duration of the scan has increased with the size of the library; currently, I have 1952 library entries (and 125532 audio files).

Note: Feel free to request any additional information needed to investigate this issue further.

Steps to reproduce the issue

  1. Configure a library containing a folder that mounts remote storage (tested with cifs and sshfs).
  2. Initiate a partial scan.

Audiobookshelf version

v2.7.1

How are you running audiobookshelf?

Docker

nichwall commented 8 months ago

I'm not sure exactly how Plex does their scans, but I think they only use file structure/names, and don't need to parse all of the metadata on files for changes. Reading the metadata tags of media files requires the transmission of the file instead of just looking at the file name and properties.

Part of the slowdown you're seeing is probably due to needing to load all of the library files into RAM to check the metadata tags. If you change your scanner settings to not look at the audio file tags at all, does that speed up the scan in ABS?

https://www.audiobookshelf.org/guides/book-scanner#book-metadata-parsing

babatonga commented 8 months ago

I'll give that a shot, although Plex also processes tags. Interestingly, when I add just one new movie, Plex efficiently reads only the tags for that specific addition. However, when I add just a small audiobook with one file, the process still takes several hours. This leads me to believe that Audiobookshelf might be rereading all metadata from every audiobook during a library scan, instead of focusing solely on the new additions.

advplyr commented 8 months ago

The part that takes the longest for Abs scans is the ffprobe on each audio file. Audio files will only be probed for a library item if it is new or a change was detected.

My guess is that Abs is detecting some change on all of your library items and it may be the inode value.

You can find out by enabling debug logs in the settings -> logs page. Then when running the scan a bunch of logs will come out about changes that are detected.

Here is the javascript of a log you will want to look for

libraryScan.addLog(LogLevel.DEBUG, `Library item "${existingLibraryItem.relPath}" changed: [${existingLibraryItem.changed()?.join(',') || ''}]`)

and

libraryScan.addLog(LogLevel.DEBUG, `Library file for library item "${libraryItemPath}" key "${key}" changed from "${existingLibraryFile.metadata[key]}" to "${scannedLibraryFile.metadata[key]}"`)

After that log is printed this tells the scanner that a library file was changed for that library item and a full scan is done on it.

babatonga commented 8 months ago

My guess is that Abs is detecting some change on all of your library items and it may be the inode value.

You're right. I just analysed my latest scan log (inside metadata/logs/scans/) and logexpert finds 1952 lines with the string Library file \"path/to/audiobooks/folder\" changed and 376222 times the second log message, here the interesting parts:

...\"mtimeMs\" changed from \"1699894581000\" to \"1699894581578\"
...\"ctimeMs\" changed from \"1699894581000\" to \"1699894581578\"
...\"birthtimeMs\" changed from \"0\" to \"1699831480927\"

It's always the same three keys that got detected as changed and birthtimeMs is always 0.

advplyr commented 8 months ago

Only those values are changing? Is there something with your mounted file system that is modifying the files?

babatonga commented 8 months ago

I'm not aware of any changes to the files. Could this possibly be a general issue with CIFS? Alternatively, it's possible that I've chosen a somewhat suboptimal structure. The SMB storage is initially mounted directly into Unraid via CIFS (/mnt/remotes/storage), and then it's remounted within the Docker container running Audiobookshelf under /audiobooks.

babatonga commented 8 months ago

If you change your scanner settings to not look at the audio file tags at all, does that speed up the scan in ABS?

I have recently disabled the 'Audio file meta tags' slider, and the last scan now took approximately 30 minutes (for 7 new audiobooks). While there is an improvement (3.5 hours with option enabled), and my folder structure %albumartist%\%series%\%year% - %album%[ '['%series% %series-part%']'] helped recognize the new books adequately, it's not as refined as with metadata tags.

advplyr commented 8 months ago

I think it is a coincidence that disabling audio file meta tags in library scanner settings improved the speed. This is because we are running an ffprobe on the audio files regardless of whether you are using the meta tags from ffprobe because we also get the codec, bitrate, etc. from that ffprobe.

I'm not an expert on CIFS mounts but I did set one up in my unraid a long time ago and it was so painfully slow I gave up on it.

You could do a test yourself by opening the console in unraid for Abs and run a stat on one of your book folders.

Example from my unraid container:

/audiobooks/Terry Goodkind/Sword of Truth # stat Wizards\ First\ Rule/
  File: Wizards First Rule/
  Size: 35              Blocks: 0          IO Block: 4096   directory
Device: 32h/50d Inode: 649362776505984201  Links: 1
Access: (0775/drwxrwxr-x)  Uid: (   99/ UNKNOWN)   Gid: (  100/   users)
Access: 2021-09-21 04:19:40.275291238 -0700
Modify: 2023-10-29 11:49:43.152104687 -0700
Change: 2023-10-29 11:49:43.152104687 -0700

Those times are what Abs is saying is changing between scans so if you note some times on a book then without doing anything check again later or the next day to see if they changed magically.

babatonga commented 8 months ago

That's looking good so far, dates seem to be valid (that is the date I added the audiobook):

/audiobooks/Vasily Mahanenko/Survival Quest # stat 2018\ -\ Der\ Weg\ des\ Schamanen\ \[Survival\ Quest\ 1]/
  File: 2018 - Der Weg des Schamanen [Survival Quest 1]/
  Size: 0               Blocks: 0          IO Block: 1048576 directory
Device: 33h/51d Inode: 319619      Links: 2
Access: (0777/drwxrwxrwx)  Uid: (   99/ UNKNOWN)   Gid: (  100/   users)
Access: 2023-11-11 06:16:08.442448000 +0100
Modify: 2023-11-11 06:16:08.442448000 +0100
Change: 2023-11-11 06:16:08.442448000 +0100

but IO Block and size have some silly values, that could be because of cifs. I will run that again after a the next scan in a few days, then we'll see if dates got changed.

I think it is a coincidence that disabling audio file meta tags in library scanner settings improved the speed

I just checked the new log file (from the faster scan), and indeed, there is no entry indicating that something changed. So, it seems like a coincidence. Before the next scan, I will restart the container.

babatonga commented 8 months ago

After remounting the CIFS share and restarting the Audiobookshelf container, the device and inode values changed:

/audiobooks/Vasily Mahanenko/Survival Quest # stat 2018\ -\ Der\ Weg\ des\ Schamanen\ \[Survival\ Quest\ 1]/
  File: 2018 - Der Weg des Schamanen [Survival Quest 1]/
  Size: 0               Blocks: 0          IO Block: 1048576 directory
Device: 73h/115d        Inode: 593163      Links: 2
Access: (0777/drwxrwxrwx)  Uid: (   99/ UNKNOWN)   Gid: (  100/   users)
Access: 2023-11-11 06:16:08.442448000 +0100
Modify: 2023-11-11 06:16:08.442448000 +0100
Change: 2023-11-11 06:16:08.442448000 +0100

Perhaps that's sufficient for Audiobookshelf to detect changes?

advplyr commented 8 months ago

Yeah the inode is something Abs monitors to handle renames. In some network file systems you can configure this to use the inode value from the underlying file system instead of constructing new ones. I can't recall the specifics with that but maybe in your unraid setup you can find a setting about inode values.

babatonga commented 8 months ago

There is an option for CIFS serverino which enables using server-side inode numbers. Without this setting those numbers will be random. I'm currently using the plugin dlandon/unassigned.devices to handle CIFS mounts and can't set that option by myself. But I could try to create a feature request there to support that option, but the owner doesn't like adding too many configuration options, because it can get too confusing (just opened there another feature request a few days before ^^)

advplyr commented 8 months ago

Ah yes that is the one. The reason that this is important is because if you change the name of a folder Abs will have no way of knowing it is the same folder without having the inode as an identifier. In Abs it would create a duplicate book and show the older one as missing. When renaming a folder the inode doesn't change so Abs can use that to check if it is the same.

nichwall commented 8 months ago

Am I correct in saying the inode should be ignored if the paths match?

I remember there was a change a few months ago to check the file path instead of just the inode when the scanner was rewritten, but can't find the exact commit/discussion about the change (somewhere around 2.4.x or 2.5.x). If the file path matches and the modification time matches, could the inode just be updated to the new value without probing everything, so if the file is moved/renamed after the scan it could still use the inode? Not sure if that would complicate things later.

https://github.com/advplyr/audiobookshelf/blob/e76af3bfc2e972dbca7851c96191f1c16a10d229/server/scanner/LibraryScanner.js#L156

advplyr commented 8 months ago

Am I correct in saying the inode should be ignored if the paths match?

I remember there was a change a few months ago to check the file path instead of just the inode when the scanner was rewritten, but can't find the exact commit/discussion about the change (somewhere around 2.4.x or 2.5.x). If the file path matches and the modification time matches, could the inode just be updated to the new value without probing everything, so if the file is moved/renamed after the scan it could still use the inode? Not sure if that would complicate things later.

https://github.com/advplyr/audiobookshelf/blob/e76af3bfc2e972dbca7851c96191f1c16a10d229/server/scanner/LibraryScanner.js#L156

This is correct and could be an area of improvement. In this case from the logs it looks like the modified time is also changing which is how we determine whether someone may have updated meta tags on the audio file. When the stat was ran in the console it did not show the modified time changed so I'm curious if resolving the inode changing issue would also resolve the modified time changing.

babatonga commented 8 months ago

I've reviewed my recent logs, and this time, all entries indicate that only the inode was changed: ... \" key \"ino\" changed from \"827328\" to \"1121761\"","levelName":"DEBUG","level":1}

I'm not sure what went wrong during the initial scan; perhaps it was related to migrating the setup (configs / database / backup) to Unraid. Additionally, I switched from sshfs to cifs. Could the timestamps be different because of that?

Maybe an additional setting in the Library Settings could help here, allowing the user to choose whether to compare inode or "just" libraryFolderId, path, relPath, mtime, ctime and birthtime during a scan to detect changes. This setting could be used in the checkLibraryItemData function inside server/scanner/LibraryItemScanData.js to add the attribute ino or not to the list keysToCompare.

If the inode value is needed elsewhere, you can also set existingLibraryItem.ino = this.ino (and save it) at the appropriate location if the setting to not use the inode is enabled. However, in my case, the value has changed several times before and has never disrupted further operations.

martinjgrunwald commented 8 months ago

I have been using ABS since Version 2.4 and this behavior has only happened to me recently, I believe since 2.7.0, even though I didn't change anything about my mount structure. ABS was never quite as fast as Plex but the difference used to be around the factor 5. I have about 900 audio books / 50.000 audio files. That takes around 20s for Plex and it used to take about a minute for ABS. Now the time for ABS has increased to around 40 minutes. I can't say for sure if the CPU usage was as high as it is now since a spike for one minute would not have been as obvious as it is now when the scan takes that long

babatonga commented 8 months ago

I tested the entire setup locally by removing 'ino' from 'keysToCompare', and it worked. However, since I had several TB of free space on my unraid array, I moved all my audio files there and now only use external storage for mirroring/backing up the files. The performance has significantly improved, not just in scanning but also in response time when playing the audiobooks. So, personally, I consider the issue resolved. However, perhaps introducing a setting to disable/update inodes during scanning could help circumvent problems with inconsistent inode values.

martinjgrunwald commented 8 months ago

Where exactly did you remove 'ino' from 'keysToCompare'?

advplyr commented 8 months ago

That could cause more confusion since the file watcher uses the inode to check if items are the same in order to detect filename changes. Do you happen to know why you weren't/aren't able to update the filesystem to use static unique inodes on files?

babatonga commented 8 months ago

Do you happen to know why you weren't/aren't able to update the filesystem to use static unique inodes on files?

There seemed to be issues with stale file handles, which is why the option for that is not included in 'unassigned.devices' (https://github.com/dlandon/unassigned.devices/issues/99). One could probably mount the whole thing manually without the plugin, but I didn't test that anymore since I switched to local storage.

advplyr commented 8 months ago

Hm okay, I'm not sure I understand the response to your FR there. I'll backlog this issue for now if/until some more users come forward with being unable to use serverino

martinjgrunwald commented 7 months ago

I am at a bit of a loss here. What can I do to avoid these long scan times and massive performance surges when scanning my library? My log is also full of notices that only 'ino' changed. Again, this wasn't always the case. Also, I rely on having my audiobooks on a different system than ABS, having them on the same one is not an option for me

babatonga commented 7 months ago

If you rely on external volumes, it's best to ensure that the inode values of the server providing the volume or other static ones are used. For CIFS, there's an option called "serverino" for this purpose. However, I'm not aware of such an option for sshfs; it seems it doesn't exist for all external mount types. If that's not possible, then we'll have to wait until multiple users report this issue and the developers find a solution.

JohanPotgieter commented 7 months ago

I have the same issue of very long scan times since version 2.7. In the past I could add a few books and the scan would be done in 30 seconds. Now it takes hours. My ABS is run on docker and mounts a external NTFS volume. My setup has not changed in any way.

advplyr commented 6 months ago

Abs relies on your file system having static inode values. I'm not sure of another way to handle renamed files, we need a way to determine that it is the same file after a rename.

sydlexius commented 5 months ago

Abs relies on your file system having static inode values. I'm not sure of another way to handle renamed files, we need a way to determine that it is the same file after a rename.

I take it using file hashes/checksums/fingerprints are out of the question?

aureateflux commented 3 months ago

I know this is a bit late, but I noticed that @babatonga's OS is Unraid. I use Unraid as well, and when I was setting up backups I came across this support thread on setting up Borgmatic for Unraid.

In that thread it notes that for Borg to work properly, you need to use mtime instead of inode:

Files cache set to use "mtime,size" - Very important as unRAID does not have persistent inode values

This is literally the only place I've seen this mentioned—despite multiple very deep dives into the topic, I've never found any other corroboration of this claim. But if true, it definitely explains the longer scan times on Unraid or similar OSes.

advplyr commented 3 months ago

I'm not sure what Borgmatic is but it is not true that Unraid does not have persistent inode values. Maybe they are talking about something specific within Unraid. I use Unraid along with many users and have persistent inode values

aureateflux commented 3 months ago

Borgmatic is an automation wrapper for Borg backup.

And yeah, that's why I made sure to point out I've never seen that claim anywhere else.