advplyr / audiobookshelf

Self-hosted audiobook and podcast server
https://audiobookshelf.org
GNU General Public License v3.0
6.37k stars 450 forks source link

[Bug]: metadata.json parser does not work for series sequence with whitespace #2380

Open wtanksleyjr opened 10 months ago

wtanksleyjr commented 10 months ago

Describe the issue

[EDIT: my example is wrong, I'll post a comment showing why. I'm not editing this apart from this note.]

When scanner finds a new book with a metadata.json file, it doesn't parse the series field correctly if the series doesn't already exist. For example, suppose the file contains "series": [ "The Pact #1"].

If there's already a series named The Pact defined, this will add a new book with its ordering=1 (as expected). But if there's no series defined with that name, the loader will create a new series named The Pact #1 (so it parses the ordering as part of the series name).

Steps to reproduce the issue

  1. Put an audio file and a metadata.json file together as indicated above, using a series name not already present on your setup.
  2. Move the folder containing both into the path of the scanner.
  3. Edit the resulting book, and edit its Series field. Observe that the whole string is placed in the series name.
  4. Delete the book entry and clean up.
  5. Add the same series name to some other book (by hand).
  6. Repeat steps 1-2 again. Observe that this time, the book's series loads as expected.

Audiobookshelf version

2.6.0

How are you running audiobookshelf?

Docker

wtanksleyjr commented 10 months ago

When scanner finds a new book with a metadata.json file, it doesn't parse the series field correctly if the series doesn't already exist. For example, suppose the file contains "series": [ "The Pact #1"].

My mistake. In order to trigger this bug, the series number HAS to include extra text. For an actual example, "Corum #4, Dramatized Adaptation" will trigger this (that's by Moorcock BTW). The intent of this extra text is to allow a series to include multiple editions of each book, so you might have Corum #1 as narrator-only, while #4 might be available both narrated and Graphical Audio. Audible actually includes this text inside the series number, and ABS does support that when doing a Match.

So it turns out your parser's perfectly fine with "The Pact #1" and so on; it's only when you include more stuff in the series-order field that it gets hairy.

Sorry about that, I hope this helps.

advplyr commented 10 months ago

Just to clarify, in some series you are storing the sequence as "4, Dramatized Adaptation"?

I'm not sure yet if we should support this now that we have moved to sqlite because in order to sort the sequences properly we have to cast it to a number. Currently there is no restriction as to what you can put in there but I think this should be changed to require it to be a number I just haven't thought of a good way to roll that out without breaking things for users already using alphanumeric.

advplyr commented 10 months ago

The parser that is pulling out the sequence number doesn't work for Corum #4, Dramatized Adaptation specifically because of the spaces. Corum #4,DramatizedAdaptation would work but as I mentioned above I don't think this should be done since the sequence may not be sorted correctly

wtanksleyjr commented 10 months ago

I kind of want to protest, I don't want to lose the information Audible includes in that field. But ... I also don't totally mind. Even though I had assumed that would be preserved, so I also stored some info in those fields (for example I have podcast audiobooks I've used the alphabet to number). I'll have to find some way to search for those and fix them by hand.

However, let's ignore that for now ... the only thing needed here is parsing the series name, and whatever solution works is what we'll have to do.

On Mon, Dec 4, 2023 at 12:49 PM advplyr @.***> wrote:

The parser that is pulling out the sequence number doesn't work for Corum

4, Dramatized Adaptation specifically because of the spaces.

Corum #4,DramatizedAdaptation would work but as I mentioned above I don't think this should be done since the sequence may not be sorted correctly

— Reply to this email directly, view it on GitHub https://github.com/advplyr/audiobookshelf/issues/2380#issuecomment-1839452744, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ7H6OZBSRN4N54T7TVP43YHYZNXAVCNFSM6AAAAABADXHP4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZZGQ2TENZUGQ . You are receiving this because you authored the thread.Message ID: @.***>

advplyr commented 10 months ago

This seems like it may be a mistake on audibles side. It doesn't make sense that would be part of the sequence. Here is the link to the API output for future reference https://api.audible.com/1.0/catalog/products/1648816800?response_groups=series,product_details

Do you have any more examples of audible doing that?

As a solution to this we should automatically split at the first space and don't include anything after. Same with user input

wtanksleyjr commented 10 months ago

It's present in almost every Graphical Audio book; as I mentioned the Match function returns the same results. The "Corum" series is the example I used. It's less common than plain integers (but don't forget decimal numbers), but still not rare.

One other comment: the order can be empty as well, and it turns out that the parser does the same thing in that case (it creates a new series starting in the same name but ending in " #").