advplyr / audiobookshelf

Self-hosted audiobook and podcast server
https://audiobookshelf.org
GNU General Public License v3.0
6.26k stars 437 forks source link

[Bug]: Books being "overwritten" during scans due to mistaken inode identity #1447

Open wtanksleyjr opened 1 year ago

wtanksleyjr commented 1 year ago

Describe the issue

Docker-compose running ABS 2.2.12 on Ubuntu Linux 22.04; all folders mapped to the same 8TB HDD (not SDD). Prefer OPF data, and ODMs as well. Only one Library configured (I've reported this in the past when I had two libraries, but I wasn't able to capture any real information at the time).

I've noticed this many times, but always too late to get useful logs: I'd remember loading a book, but when I search for it, it isn't there. So I re-download. Later, I'd find a book that has the same cover as the one I re-downloaded but some other book's title. When I look at its folder and files, I find that the title matches the files, but the cover is from the newly missing book.

Interestingly, it turns out the "missing" book is right there on the drive where I'd put it, and absolutely nothing is ever suspicious about either the original book or the one that replaced it leaving only its cover - nothing's mixed up on the disk. So this is not a data loss problem, only metadata confusion.

So I've finally noticed this happening in time to actually capture logs -- the problem was that this is apparently random, so it usually gets books I don't notice because I loaded them a long time ago, or for which my logs are long gone.

The smoking gun is a line saying something like [2023-01-27 03:12:49] DEBUG: [Scanner] scanFolderUpdates: Library item found by inode value "olderbook => newerbook" (Scanner.js:584). (Those two are the names of the folders directly containing the books.) I can see this line is intended to handle hard links, but every time it appears in my repo it's mistaken, there's no hard link there (if you want I'll deep-scan for duplicate inode numbers, but I've already confirmed none of the three books in this example share any common inode numbers for any of their files).

My first example involves two "=>" lines that I actually WATCHED swap the same cover's title twice - it was random that I was looking in the right place at the right time. The original book is "The Skull of the World"; it had been loaded and Matched on the 22nd, and today the 26th I happened to be looking at newly loaded books, and I saw the cover for Skull of the World appear with the title "American Resistance", which I'd just gotten. Within a few second, the title changed again to "Unfree Speech" (still keeping the same "Skull" cover). Needless to say, both of those books had just been moved into the repo - but now I only have one book with the wrong cover and a jamble of metadata left over because the nonfiction books didn't have a subtitle or series so the resulting book had both of those from the original fiction book.

How's that for an intro? Any questions? Do I need to provide logs? Can I trim them a little? I've got them saved in full.

Steps to reproduce the issue

  1. Reproducing the issue seems unlikely. This is triggered by a combination of loading a new book, plus the existence of an older book, plus unknown and possibly random factors.

Audiobookshelf version

2.2.12

How are you running audiobookshelf?

Docker

advplyr commented 1 year ago

I'm not sure what is going on here but it would be helpful to see what the actual inode value is so I added it in the logs for the latest release. If you have this happen again that same line of the log you shared will include the inode value.

wtanksleyjr commented 1 year ago

100% will check for that.

-Wm

On Sat, Feb 11, 2023 at 3:55 PM advplyr @.***> wrote:

I'm not sure what is going on here but it would be helpful to see what the actual inode value is so I added it in the logs for the latest release. If you have this happen again that same line of the log you shared will include the inode value.

— Reply to this email directly, view it on GitHub https://github.com/advplyr/audiobookshelf/issues/1447#issuecomment-1426901479, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ7H6L3WCKK3Y5T2W7SF63WXARFVANCNFSM6AAAAAAUIIVNMY . You are receiving this because you authored the thread.Message ID: @.***>

ceramicwhite commented 1 year ago

@advplyr

This happens on my server sometimes after I do scan.

A couple things I've noticed:

I have v2.2.15 running and the cover problem just happened again but I don't see the Inode values and debug is on. Here's the log if it helps: 2023-02-18.txt

Book with incorrect cover is Trackers (li_o1kwryisuc929ch8ix) Cover being shown is Polar Shift (li_83e4qfl4ce94b1ibtq)

Hope this helps

wtanksleyjr commented 1 year ago

Wow, it happened ... I didn't expect to ever see it again, but this time I got something recent enough that I even have the logs!

163 [2023-03-22 15:31:58] DEBUG: [Scanner] scanFolderUpdates: Library item found by inode value=63328265. "Solo Leveling, Vol. 2 (Novel) - B09GHJTKSJ => 9524067" (Scanner.js:585)

Here's a lightly edited log (I searched for the two books concerned and kept everything between the two islands of log entries that mention them). I have the full logs as well, if you want.

narrowed-log.txt

(The second item is in a folder with a numeric name, in case you're wondering - it's a unique ID. The other one is how I have Libation configured to save my Audible books.)

The sequence of events here was that I'd returned and deleted "Solo Leveling", so the item is marked missing; but in the meantime I checked out "The Staff Engineer's Path". Apparently the latter item is somehow seen to have the same inode as the former (you can now see its value right above).

One question ... what file is being examined for an inode there?

advplyr commented 1 year ago

Ah that could be it. Just to clarify, you first deleted the book folder (which would have freed up that inode value), then you added the new book which got incorrectly matched?

The inode being used is the folder

wtanksleyjr commented 1 year ago

That's correct in this case, although not in any of the previous cases (those all involved both books still existing).

Additionally, when you compare inodes, are you guarding against comparing them across filesystems (inodes are not unique across filesystems)? Can you even tell from inside Docker? Perhaps this is the cause of the other guy who posted here involving both an SSD and HDD. In my case I know they're the same filesystem and even the same Docker mount point so they're sure to be unique.

On Wed, Mar 22, 2023 at 1:52 PM advplyr @.***> wrote:

Ah that could be it. Just to clarify, you first deleted the book folder (which would have freed up that inode value), then you added the new book which got incorrectly matched?

The inode being used is the folder

— Reply to this email directly, view it on GitHub https://github.com/advplyr/audiobookshelf/issues/1447#issuecomment-1480244946, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ7H6OXGXOUV6USY3AFGCLW5NRCPANCNFSM6AAAAAAUIIVNMY . You are receiving this because you were mentioned.Message ID: @.***>

wtanksleyjr commented 1 year ago

My gut feeling is that this matching is there for only one special case: if the "newly discovered" folder is actually an already-scanned folder. The implication is you're detecting renames, which makes sense and I don't know why it's generating false positives except for this new case.

But it seems to me the current method isn't the greatest way to handle renames, because it creates both an orphaned book object and a new book object for the rename even when it's correct. And the new book object contains two layers of metadata, the result of the initial scan on the newer folder with the nonblank items from the older item applied over it (which, by the way, easily produces very bad results). It would seem better to process the rename detection by changing the folder name on the old book's object, instead of by creating a new book and then overwriting its metadata.

I'm trying to think of ways to handle this correctly, and it's hard to do. One thing that makes sense is to never treat a discovered inode as a rename when the book that has the old inode still exists -- that's a quick test that would completely stop all of the cases I reported above even though we still don't know why they happen.

That wouldn't stop my new problem, of course. But that's going to be a very rare problem, since it only happens if I leave a deleted item just sitting there while I load new items. It's rather a bad problem, mind you, because it can result in completely overwriting a new book's info with an old book. But I can't think of how to prevent it apart from keeping it rare ... maybe when a book's marked as completely missing, remove its inode record after finishing the scans as completely as possible? That way rename detections can still be done the old way, but inode reuse won't kill us (unless the system reuses the inode really quickly).

advplyr commented 1 year ago

But it seems to me the current method isn't the greatest way to handle renames, because it creates both an orphaned book object and a new book object for the rename even when it's correct. And the new book object contains two layers of metadata, the result of the initial scan on the newer folder with the nonblank items from the older item applied over it (which, by the way, easily produces very bad results). It would seem better to process the rename detection by changing the folder name on the old book's object, instead of by creating a new book and then overwriting its metadata.

This was how it was originally built but there is a bug somewhere. I opened an issue for this a while ago https://github.com/advplyr/audiobookshelf/issues/1161

This issue came up again and has me looking at why some filesystems have such small inode values. This comment had some insight https://github.com/advplyr/audiobookshelf/issues/1733#issuecomment-1527178564

I don't think there is a way to detect the actual filesystem from inside docker but if there is that would be the best solution.

advplyr commented 1 year ago

I'm trying to think of ways to handle this correctly, and it's hard to do. One thing that makes sense is to never treat a discovered inode as a rename when the book that has the old inode still exists -- that's a quick test that would completely stop all of the cases I reported above even though we still don't know why they happen.

This is a good idea.

For reference the inode value on my XFS filesystems are this long 649362779914890946. So even if I had my audiobooks spread across 100 drives the likelihood of a collision would be very small.

zoredache commented 1 year ago

I was finally got around to importing a bunch of audio books I picked up from old Humble Bundles in the past to m4b and adding them to my library. I added like ~60 books at once, and this trashed the metadata for like 50 books got trashed.

My system has is using zfs, and I have separate datasets I use for my audible backups, and the directory I use for books I purchased outside audible.

As an example. One of the books was the Big Finish book Phantasmagoria which I added to purchased_misc doing this trashed the metadata for the book 'The Happiness Hypothesis'.

Stat output

$ stat \
  audible_backup/The\ Happiness\ Hypothesis\ \[B07D5HTRPZ\]/ \
  purchased_misc/Phantasmagoria/

  File: audible_backup/The Happiness Hypothesis [B07D5HTRPZ]/
  Size: 6               Blocks: 17         IO Block: 512    directory
Device: 30h/48d Inode: 8717        Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-09-05 07:39:03.923754797 +0000
Modify: 2022-09-05 07:39:06.943694478 +0000
Change: 2022-09-06 05:55:48.840891042 +0000
 Birth: -
  File: purchased_misc/Phantasmagoria/
  Size: 3               Blocks: 1          IO Block: 512    directory
Device: 2eh/46d Inode: 8717        Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-07-12 21:45:04.321678078 +0000
Modify: 2023-07-12 21:45:04.321678078 +0000
Change: 2023-07-14 05:13:27.441428101 +0000
 Birth: -

Audiobookshelf logs

2023-07-14 05:13:29 DEBUG [Watcher] File Added /content/audiobooks/purchased_misc/Phantasmagoria/Phantasmagoria.m4b
2023-07-14 05:13:29 DEBUG [Watcher] Modified file in library "audiobooks" and folder "fol_0o7kftyldks4h6jd8o" with relPath "/Phantasmagoria/Phantasmagoria.m4b"
2023-07-14 05:13:33 INFO [Server] 1 Files Changed
2023-07-14 05:13:33 DEBUG [Scanner] Scanning file update groups in folder "fol_0o7kftyldks4h6jd8o" of library "audiobooks"
2023-07-14 05:13:33 DEBUG [Scanner] scanFolderUpdates fileUpdateGroup [object Object]
2023-07-14 05:13:33 DEBUG [Scanner] scanFolderUpdates: Library item found by inode value=8717. "The Happiness Hypothesis [B07D5HTRPZ] => Phantasmagoria"
2023-07-14 05:13:33 DEBUG [Scanner] Folder update for relative path "Phantasmagoria" is in library item "The Happiness Hypothesis [B07D5HTRPZ]" - scan for updates
2023-07-14 05:13:33 WARN [LibraryItem] Check scan item changed folder fol_41bzqfhnyn7l3wnjak -> fol_0o7kftyldks4h6jd8o
2023-07-14 05:13:33 DEBUG [LibraryItem] "The Happiness Hypothesis [B07D5HTRPZ]" check scan cover removed
2023-07-14 05:13:33 DEBUG Library Item "/content/audiobooks/purchased_misc/Phantasmagoria" Media file scan took 79ms with 1 audio files averaging 79ms per MB
2023-07-14 05:13:33 DEBUG [Book] Tracks being rebuilt...!
2023-07-14 05:13:33 DEBUG [Book] setChapters: Using embedded chapters in first audio file /content/audiobooks/purchased_misc/Phantasmagoria/Phantasmagoria.m4b
2023-07-14 05:13:33 DEBUG [FfmpegHelpers] Extract Cover Cmd: ffmpeg -i /content/audiobooks/purchased_misc/Phantasmagoria/Phantasmagoria.m4b -y -map 0:v -frames:v 1 /metadata/items/li_ee8ct39mbne8iuzu8g/cover.jpg
2023-07-14 05:13:33 DEBUG [FfmpegHelpers] Cover Art Extracted Successfully
2023-07-14 05:13:33 DEBUG [Scanner] Saved embedded cover art "/metadata/items/li_ee8ct39mbne8iuzu8g/cover.jpg"
2023-07-14 05:13:33 DEBUG [LibraryItem] Success saving abmetadata to "/metadata/items/li_ee8ct39mbne8iuzu8g/metadata.abs"
2023-07-14 05:13:34 DEBUG [DB] Library Items updated 1
2023-07-14 05:13:34 DEBUG [Scanner] Folder scan results [object Object]

Anyway my first suggestion is if you see an 'File Added' that where the inode matches an existing book. Add a check so that you don't update the old book, if the files still exist for the old book.

advplyr commented 11 months ago

As of 2.4.3 the scanner is no longer checking for inode first. It is looking for exact matching file path first then falling back to inode. There is more to be done here because you could still have this collision for renames.