advplyr / audiobookshelf

Self-hosted audiobook and podcast server
https://audiobookshelf.org
GNU General Public License v3.0
6.88k stars 486 forks source link

[Enhancement] Support parsing comma separated genres #3127

Open NitzanNougat opened 4 months ago

NitzanNougat commented 4 months ago

What happened?

Most of my audiobooks are sourced from Audible by Libation.
When initially downloading these books, I did so without metadata.
However, for those downloaded with metadata, it is in CSV format rather than JSON(I have no idea if that would make a difference).

The issue arrises in both cases with and without metadata:

For example, for the book Cosmos by Carl Sagan, the genres are currently listed as: Genres: "Astronomy, Cosmology, Biological Sciences, Atmospheric Sciences, Physics"

The entire genre string is treated as a single genre, making it impossible to search for the book by individual genres.

What did you expect to happen?

I would expect the genres to be parsed as individual entries: Genres: "Astronomy", "Cosmology", "Biological Sciences", "Atmospheric Sciences", "Physics"

Each genre should be recognized as a separate entry, in order to filter or search by genre and get accurate results

Steps to reproduce the issue

Import an Audible audiobook by Libation without metadata/with .csv metadata Install ABS v2.10.1 via docker compose. Scan the new libraries.

Audiobookshelf version

v2.10.1

How are you running audiobookshelf?

Docker

What OS is your Audiobookshelf server hosted from?

Linux

If the issue is being seen in the UI, what browsers are you seeing the problem on?

None

Logs

csv meta data example:

{"timestamp":"2024-06-22T18:15:35.804Z","message":"\"The Last Command\" Getting metadata with precedence [folderStructure, audioMetatags, nfoFile, txtFiles, opfFile, absMetadata]","levelName":"DEBUG","level":1}
{"timestamp":"2024-06-22T18:15:35.804Z","message":"setChapters: Using embedded chapters in first audio file /audiobooks/Timothy Zahn/The Thrawn Trilogy/Book 3 - The Last Command/The Last Command Track 1.m4b","levelName":"DEBUG","level":1}
{"timestamp":"2024-06-22T18:15:36.958Z","message":"Success saving abmetadata to \"/metadata/items/fae01eb8-881d-41fd-928e-35b3c58213c9/metadata.json\"","levelName":"DEBUG","level":1}
{"timestamp":"2024-06-22T18:15:36.958Z","message":"Created new library item \"Timothy Zahn/The Thrawn Trilogy/Book 3 - The Last Command\"","levelName":"INFO","level":2}

no metadata example:

{"timestamp":"2024-06-22T18:14:50.395Z","message":"\"Cosmos꞉ A Personal Voyage\" Getting metadata with precedence [folderStructure, audioMetatags, nfoFile, txtFiles, opfFile, absMetadata]","levelName":"DEBUG","level":1}
{"timestamp":"2024-06-22T18:14:50.396Z","message":"setChapters: Using embedded chapters in first audio file /audiobooks/Carl Sagan/Cosmos꞉ A Personal Voyage/Cosmos꞉ A Personal Voyage Track 1.m4b","levelName":"DEBUG","level":1}
{"timestamp":"2024-06-22T18:14:51.177Z","message":"Success saving abmetadata to \"/metadata/items/5fc031e3-52ba-4806-a693-16708693a3ba/metadata.json\"","levelName":"DEBUG","level":1}
{"timestamp":"2024-06-22T18:14:51.177Z","message":"Created new library item \"Carl Sagan/Cosmos꞉ A Personal Voyage\"","levelName":"INFO","level":2}

Additional Notes

No response

mikiher commented 4 months ago

Just to be clear - this has nothing to do with csv. Audiobookshelf doesn't import metadata from csv files. The metadata is usually read from the audio file itself (or from some other sources supported by ABS, which don't include csv). Libation by default embeds the metadata into the audio file (this is controlled in Libation by Settings -> Audio File Settings -> Allow Libation to fix up audiobook metadata).

Anyway, I did reproduce the behavior you describe, and I'll try to fix it.

advplyr commented 4 months ago

Comma was intentionally left out when I set this up a few years ago. I believe that some genres from Audible have commas in them so if we split on comma then it would break those genres. We should confirm this before adding comma, it may not actually be an issue but I remember intentionally leaving comma out.

advplyr commented 4 months ago

Found an example: https://api.audnex.us/books/B01CUKULGA


"genres": [
{
"asin": "18574597011",
"name": "Mystery, Thriller & Suspense",
"type": "genre"
},
{
"asin": "18580606011",
"name": "Science Fiction & Fantasy",
"type": "genre"
},
{
"asin": "18574621011",
"name": "Thriller & Suspense",
"type": "tag"
},
]
mikiher commented 4 months ago

We cannot ignore, though, a quite significant data source (Libation), that seems to always put commas between genres.

Between getting all Libation multi-genre tags wrong (which also pollutes the genres data in ABS), and sometimes splitting a genre mistakenly, the latter seems preferable.

But let me first try to think if there's some heuristic that will let us eat the cake and leave it whole.

On Mon, Jul 8, 2024, 17:13 advplyr @.***> wrote:

Found an example: https://api.audnex.us/books/B01CUKULGA

"genres": [ {"asin": "18574597011","name": "Mystery, Thriller & Suspense","type": "genre" }, {"asin": "18580606011","name": "Science Fiction & Fantasy","type": "genre" }, {"asin": "18574621011","name": "Thriller & Suspense","type": "tag" }, ]

— Reply to this email directly, view it on GitHub https://github.com/advplyr/audiobookshelf/issues/3127#issuecomment-2214195079, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMDFVST3U4RF3Q25I6EMK3ZLKNAXAVCNFSM6AAAAABKOUF3K6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJUGE4TKMBXHE . You are receiving this because you commented.Message ID: @.***>

advplyr commented 4 months ago

This issue has been brought up before with Libation https://github.com/advplyr/audiobookshelf/issues/2539 I've never used it before, maybe they have an option to not use comma?

Even though there is no official spec for delimiters on multiple genres it is pretty widely adopted the ones we use and I'm not sure of any meta tagging software that supports comma.

As far as data sources go I would guess Audible is the vast majority. I'm not opposed to supporting comma delimiters if it can be non-disruptive but certainly not a bug.

Related https://github.com/advplyr/audiobookshelf/issues/1864 https://github.com/advplyr/audiobookshelf/issues/1998

mikiher commented 4 months ago

So, just to have some data points about this: Audible has a page that shows it's Level 1 and 2 categories (which are used as genres in metadata). These aren't all the genres since there are also some lower level categories that don't appear in this page, but I think it gives some notion of how Audible genres look like. I scraped the data into a Google sheet and ran a couple of stats.

Out of 212 unique genres, 13 contain a comma (~6%). All of the ones containing a comma are of the form "A, B & C".

I'm not sure exactly what to do with this info yet, just wanted to share.

NitzanNougat commented 4 months ago

Hi, thanks for the quick reply :)

tbh, I don't mind splitting these unique examples down the middle. For example, for the genre "Fitness, Diet & Nutrition," I'm okay if "Fitness" ends up as a separate genre. It might even help if I'm searching for just "Fitness," as it would show up in that category instead of only under "Fitness, Diet & Nutrition," which might be specific to Audible.

I'm thinking a possible(ugly) idea might just be to check for the unique cases you mentioned, specifically from Audible:

For genres that don't contain one of the unique genres, just use ',' as a regular separator. Regarding the unique genres, maybe remove the substring from the genreTag and then separate by ',' and insert the unique genre later(or something like it but cleaner)?

Thanks!

mikiher commented 4 months ago

In the meantime, until this is resolved, running a match in Audiobookshelf with Audible.com as provider will get this fixed for you effortlessly.

NitzanNougat commented 4 months ago

I updated to the newest version of ABS and ran a match.
Afterward, I noticed that the genres are still the same. Do I need to delete all the genres and run a match again?

Anyhow, I noticed that book tags are separated by commas, though I didn't check this before the update.
And tbh, searching by tags instead of genres works well enough for me.

mikiher commented 4 months ago

In Audiobookshelf Settings, there's an option called "Prefer matched metadata". Turn that option on, and then matching will override existing metadata.

NitzanNougat commented 4 months ago

Great it has overriden the previous genres,it didn't split up genres like Mystery, Thriller & Suspense.

fyi i have found only 1 genre that it didn't split up: [Wars & Conflicts, Greece, Civilization] which should be 3 separated genres but that is minor edge case.

Really appreciate the quick help!

mikiher commented 4 months ago

@advplyr going back to the original discussion - from my perspective, we're trying to get as much data as possible from the input audio file, with the highest accuracy possible.

With that view in mind, what I'm trying to do is to get genres from Libation-encoded audio file with ~94% accuracy (given the stats we have from the Audible category page), instead of getting them wrong almost every time there's more than one genre. To check this, I looked at the Libation export data from my own Audible library. The library contains 451 books, of which 374 have more than 1 genre. This means that accuracy using the current scanning algorithm would be ~((451-374)/451)=~17%.

So I'm trying to trade 17% accuracy with 94% accuracy. Plus, I'm willing to scrape all genres containing a comma from Audible (I don't think their list of genres is very dynamic), and match against these, so we're a 100% accurate on Libation-encoded books.

Does this make sense?

mikiher commented 4 months ago

Great it has overriden the previous genres,it didn't split up genres like Mystery, Thriller & Suspense.

Yes, that's expected. The provider we use returns genres one by one, not as a comma-separated list, so we can tell the genres for sure.

fyi i have found only 1 genre that it didn't split up: [Wars & Conflicts, Greece, Civilization] which should be 3 separated genres but that is minor edge case.

Can you tell me the book name and author for which this happned?

Really appreciate the quick help!

NitzanNougat commented 4 months ago

A War Like No Other How the Athenians and Spartans Fought the Peloponnesian War By: Victor Davis Hanson

advplyr commented 4 months ago

I think it will be confusing if we split on comma-separated lists but leave the Audible genres with commas. Has anyone opened an issue with this software that is the only one embedding genres with commas? The algorithm should be straightforward with what delimiters we support. I don't mind splitting those Audible genres up personally but we may have other users using commas in their genres. I can ask in the Discord