advplyr / audiobookshelf

Self-hosted audiobook and podcast server
https://audiobookshelf.org
GNU General Public License v3.0
5.75k stars 395 forks source link

[Bug]: Podcasts duplicate in UIs #2785

Open BlackHoleFox opened 3 months ago

BlackHoleFox commented 3 months ago

Describe the issue

Audiobookshelf seems to be generating, or somehow finding, duplicate episodes of multiple podcasts I have added and listen to. This is similar to https://github.com/advplyr/audiobookshelf/issues/2122 but doesn't create duplicate files on disk and happens in versions after that bugfix.

Weirdly this doesn't happen with every single podcast in my library, just a few. The ones I notice the most are:

Nothing jumps out in my logs either, though let me know if I should enable debug logs for a week or so to help.

I've deleted duplicates from the library multiple times at this point, but they just seem to be coming back. For example AMCA's library entry has every new episode duplicated right now:

Capture

One of each is not linked to an RSS episode:

Capture

And there are no duplicates on disk. They're purely database/UI located. They show up duplicated in both the web UI and Android app:

Capture

Steps to reproduce the issue

  1. Add https://amorecivilizedage.net/rss as a podcast in your library
  2. Enable Schedule Automatic Episode Downloads. I have it running every day at midnight.
  3. Wait a few days

Audiobookshelf version

2.8.0

How are you running audiobookshelf?

Docker

advplyr commented 3 months ago

Can you disable the watcher in server settings as a test to see if you get duplicates on any episodes downloaded after that?

BlackHoleFox commented 3 months ago

Turned it off and restarted audiobookshelf's container. I'll cleanup the duplicates in the two library items I mentioned and see if they come back.

BlackHoleFox commented 3 months ago

Update: It seems to be the library scanner causing these? It runs periodically which would explain why they show up at a delay. It autoran in the last hour or two and the duplicates showed up. I cleaned out the duplicates from both podcasts once more and manually started a scan from Settings --> Libraries --> Scan. The scan finished and told me some items updated:

image

And then they showed up again in the podcast's library page. All the duplicated ones also showed up in the Newest Episodes/recently added row. The actual newest episode is one to the right off screen: image

I can reliably reproduce this now with the steps above.

advplyr commented 3 months ago

Can you share some about how you are mapping volumes in docker and what file system/OS you are using?

BlackHoleFox commented 3 months ago

Yeah, here you are:

My audiobookshelf data is stored between two places: The metadata and config for it is stored on the host but the podcasts themselves are stored elsewhere (for size and reliability reasons) and then mounted via SMB on the host running the container. From there, the podcasts folder is bind mounted into Docker so audiobookshelf can see it. Here's the relevant part of my Dockercompose config:

volumes:
      - /home/user/deployment-data/audiobookshelf/audiobooks:/audiobooks:rw
      - /mnt/media/podcasts:/podcasts:rw
      - /home/user/deployment-data/audiobookshelf/metadata:/metadata
      - /home/user/deployment-data/audiobookshelf/config:/config
BlackHoleFox commented 2 months ago

@advplyr Hiyo I'm back with with more info after more digging around. I started under the hypothesis this might have been something specific to my SMB share setup but after reviewing all the audiobookshelf logs with debug logging turned on, I saw nothing pointing at the storage location causing the weird broken episode duplicates.

Starting this off, I double/extra/etc checked there aren't duplicates in the RSS feed:

rss_single

And now: Here's the debug and scanner logs when initiating a scan after removing 1 of the duplicate podcast episodes from the UI. Its a little odd that it says updated here but the scanner log says "new podcast episode".

[LibraryScanner] Library scan e1099d4c-caca-4ea5-a6aa-6508b2343855 completed in 0:07.0 | 0 Added I
1 Updated | 0 Missing
[ApiCacheManager]library.afterUpdate: Clearing cache
[LibraryScan] Scan log saved "/metadata/logs/scans/2024-04-28_e1099d4c-caca-4ea5-a6aa-6508b2343855.txt"
[LibraryController] Scan complete

And then the relevant part of the scanlog. The rest of it is just saying every other podcast is up to date:

{"timestamp":"2024-04-28T04:35:57.662Z","message":"New library file found with path \"/podcasts/shared/A More Civilized Age -  A Star Wars Podcast/82 - Always Two There Are and Brothers of the Broken Horn (Rebels 20 - 21).mp3\" for library item \"A More Civilized Age -  A Star Wars Podcast\"","levelName":"INFO","level":2}
{"timestamp":"2024-04-28T04:35:57.663Z","message":"Library item \"A More Civilized Age -  A Star Wars Podcast\" changed: [size,lastScan]","levelName":"DEBUG","level":1}
{"timestamp":"2024-04-28T04:35:58.246Z","message":"Mapping metadata to key tagComment => description: "TRIMMED","levelName":"DEBUG","level":1}
{"timestamp":"2024-04-28T04:35:58.249Z","message":"Mapping metadata to key tagSubtitle => subtitle: TRIMMED","levelName":"DEBUG","level":1}
{"timestamp":"2024-04-28T04:35:58.251Z","message":"Mapping metadata to key tagDate => pubDate: Wed, 20 Mar 2024 10:00:00 +0000","levelName":"DEBUG","level":1}
{"timestamp":"2024-04-28T04:35:58.252Z","message":"Mapping metadata to key tagTitle => title: 82: Always Two There Are and Brothers of the Broken Horn (Rebels 20 - 21)","levelName":"DEBUG","level":1}
{"timestamp":"2024-04-28T04:35:58.253Z","message":"New Podcast episode \"82: Always Two There Are and Brothers of the Broken Horn (Rebels 20 - 21)\" added","levelName":"INFO","level":2}
{"timestamp":"2024-04-28T04:35:58.298Z","message":"Mapping metadata to key tagAlbum => title: A More Civilized Age: A Star Wars Podcast","levelName":"DEBUG","level":1}
{"timestamp":"2024-04-28T04:35:58.299Z","message":"Mapping metadata to key tagArtist => author: A More Civilized Age","levelName":"DEBUG","level":1}
{"timestamp":"2024-04-28T04:35:58.301Z","message":"Mapping metadata to key tagGenre => genres: TV & Film:After Shows, Leisure:Animation & Manga","levelName":"DEBUG","level":1}
{"timestamp":"2024-04-28T04:35:58.302Z","message":"Mapping metadata to key tagLanguage => language: en","levelName":"DEBUG","level":1}
{"timestamp":"2024-04-28T04:35:58.303Z","message":"Mapping metadata to key tagPodcastType => podcastType: episodic","levelName":"DEBUG","level":1}
{"timestamp":"2024-04-28T04:35:58.311Z","message":"Found metadata file \"/metadata/items/51b244e1-bfd2-4e50-8bc3-a9e81954d099/metadata.json\"","levelName":"INFO","level":2}
{"timestamp":"2024-04-28T04:35:58.407Z","message":"Library item \"Scanline Media Patron Podcasts\" is up-to-date","levelName":"DEBUG","level":1}

When I remove one of these broken episodes, nothing indicates a problem:

[LibraryItem] Library item "51b244e1-bfd2-4e50-8bc3-a9e81954d099" updated

Looking at the all episode list, the normal and broken episode appear (stealth strike). The second goes away temporarily when I delete it from the UI. But it always returns after a library scan. image

So with not much progress on the surface the database got cloned to my computer and opened as well :) Despite deleting broken episodes in the UI, they still had a database entry, which looks very wrong compared to the correct one:

database_dupe

There's a bunch of "haunted" items like this in my library throughout multiple podcasts, and those all have broken duplicate database entries:

other_examples

The interesting/important thing is that when I deleted the broken record from the podcastEpisodes table, pushed the modified database to my server, and started the docker container it properly scrubbed the haunted episode away. When I scan the library, everything reports up-to-date and the duplicate doesn't reappear. Checking the RSS feed for new episodes doesn't find any junk either.

Pretty puzzled how they got there, honestly. Nothing weird or dangerous has been done to the database or filesystem. Maybe its the result of broken RSS feeds churning and leaving remnants in the database if the creators re-uploaded etc?

So that was a lot, but I'm interested in what you think of the broken database rows as someone who actually knows what everything should look like. Maybe a possible fix could be improving the scanner to look for any rows with a NULL index (and maybe no enclosureSize + enclosureType as well) and deleting them as a cleanup task? I could also manually scrub the database once but honestly these might eventually trickle back in, so an automatic purger could be better?

advplyr commented 2 months ago

The enclosure fields are only populated if the episode was downloaded from an RSS feed. Not all podcast episodes need to come from an RSS feed, you may have audio files in your file system that are a podcast but don't have an associated RSS feed. So we wouldn't want to remove them automatically from the db like you are suggesting.

That is what the duplicates you are seeing are. They are episodes that were scanned in from the file system and were not matched with an existing episode while being scanned in so a new episode was created.

Still more information is needed to figure out what is going on. If you can enable Debug logs in the Logs page of settings this will provide more information during the scan. Then re-create the issue and we should see where both episodes are being created, one will be from the scanner while the other will be getting created during the RSS feed download.

Also if you can be on the latest version v2.9.0

BlackHoleFox commented 2 months ago

So we wouldn't want to remove them automatically from the db like you are suggesting.

Got it 👍. Does the same train of thought apply to the NULL index field observation as well?

They are episodes that were scanned in from the file system and were not matched with an existing episode while being scanned in so a new episode was created.

Sounds pretty strange, honestly. I don't manually edit the filesystem Audiobookshelf is working with and get everything from RSS. Could filesystem metadata changes (mtime, ACLs) etc break this (and only for a specific set of episodes)? There also have never been duplicate files for these episodes on disk. It points to the same exact MP3 file the real episode does and is why I can't check the "hard delete" option for these ghost episodes in the UI or it will delete the source file leaving the true episode broken.

If you can enable Debug logs in the Logs page of settings this will provide more information during the scan.

All the logs I shared above were with Debug logging enabled :rip: There wasn't any more detail available from the scan which "brought back" a deleted ghost episode.

Then re-create the issue and we should see where both episodes are being created, one will be from the scanner while the other will be getting created during the RSS feed download.

Yeah I will try, but I still don't know what actually causes these to be created. They have just randomly appeared so far.

Also if you can be on the latest version v2.9.0

I updated a few days ago, so all the logs above come from 2.9.0 :) Appreciate the help so far though.