Kareadita / Kavita

Kavita is a fast, feature rich, cross platform reading server. Built with the goal of being a full solution for all your reading needs. Setup your own server and share your reading collection with your friends and family.
http://www.kavitareader.com
GNU General Public License v3.0
5.89k stars 300 forks source link

Filename keywords for comic specials don't work properly #1618

Open pssandhu opened 1 year ago

pssandhu commented 1 year ago

Describe the bug This page of the wiki lists special keywords that can be used in filenames to get kavita to mark them as specials.

To Reproduce Steps to reproduce the behavior:

  1. Have a library with this folder structure (the files have no metadata in them):
Library Root/
    |-- Batman Beyond (1999)/
        |-- Batman Beyond 01.cbz
        |-- Batman Beyond ...
        |-- Batman Beyond 06.cbz
        |-- Batman Beyond Annual 01.cbz
        |-- Batman Beyond Annual 02.cbz
  1. Scan the library
  2. The issues show up in kavita correctly but the annuals are not there

Expected behavior The annuals should be in the library and marked as specials

Desktop (please complete the following information):

Additional context On a fresh install the annuals showed up as a separate series called Batman Beyond Annual but I'm unable to reproduce that after renaming the files. Might need to do a fresh install to see this.

I tested various filenames one at a time (shown below). Some of these are unlikely to be real filenames like Batman Beyond Book.cbz or Batman Beyond Annual.cbz but I thought testing these might be useful.

Treated as a special:

01 Annual Batman Beyond.cbz
Batman Beyond Omnibus.cbz
Batman Beyond Omnibus (1999).cbz
Batman Beyond TPB.cbz
Batman Beyond Bonus.cbz
Batman Beyond Specials.cbz
Batman Beyond OneShot.cbz

Treated as a volume:

Batman Beyond TPB v01.cbz

Not added to library:

Batman Beyond Annual 01 (1999).cbz
Batman Beyond Annual 01.cbz
Batman Beyond Annual #001.cbz
Batman Beyond Annual (1999).cbz
Batman Beyond Annual.cbz
1 Annual Batman Beyond.cbz
Batman Beyond Omnibus 01.cbz
Batman Beyond Omnibus 1.cbz
Batman Beyond TPB 1.cbz
Batman Beyond Book 01.cbz
Batman Beyond Book.cbz
majora2007 commented 1 year ago

Hi @tjarls, this issue looks to be from https://github.com/Kareadita/Kavita/pull/1531 where specials were word bounded to prevent false positives, however from testing, I'm seeing "Batman Beyond Annual" and "Ippo - Artbook" to not be considered a special.

From the regex: |\d.+?\WAnnual|Annual\W\d.+?|, you'd think it'd work, but it does not match. Any ideas?

majora2007 commented 1 year ago

It seems the \d is what is our culprit, so what if we just left \d out and changed to .\WAnnual, which would still act very similar.

majora2007 commented 1 year ago

I settled for \b(?:\d.+?(\W|-|^)Annual|Annual(\W|-|$))\b (just looking at Annual). This allows us to meet all the Unit tests and works against this case as well.

tjarls commented 1 year ago

That behaviour predates #1531 and has nothing to do with word bounding the key words used for special. I suspect the intention was indeed to only match Annual alongside a year or just a number. Comic annuals almost always have a date or at least a number alonside the world "annual" (for example Action Comics 2021 Annual, Amazing Spider-Man Annual 2). Here's the original regex from before the word bounding changes:

@"(?<Special>Specials?|OneShot|One\-Shot|\d.+?(\W|_|-)Annual|Annual(\W|_|-)\d.+?|Extra(?:(\sChapter)?[^\S])|Book \d.+?|Compendium \d.+?|Omnibus \d.+?|[_\s\-]TPB[_\s\-]
|FCBD \d.+?|Absolute \d.+?|Preview \d.+?|Art Collection|Side(\s|_)Stories|Bonus|Hors Série|(\W|_|-)HS(\W|_|-)|(\W|_|-)THS(\W|_|-))",

The unit test for Annual Days of Summer is another evidence that this was indeed legacy behaviour. So there hasn't been any regression on that side.

On the other hand, the widening of the match for Annual and, even more so, for Absolute, is introducing quite a few additional false positives. Some examples I have spotted in my own library

While there are mechanisms to tag as special individual issues, there isn't a way of doing the opposite short of actually changing the series names on those affected.

so I do not think it's a good idea to so widden the filename special matching mechanism for Absolute and to a lesser extend Annual without year. There are plenty of other words that are occasionally used to denote a special edition of a series (Definitive, Collected, Album, Digest,..) that are not automatically detected and covering all of them is both non realistic and unnecessary as the explict tagging with SP# for example or setting the comicinfo "special" tag already covers those case. We do not have a "not-special" tag for the reverse use case. So the current fix looks to me as a regression where we are trading minor annoyances that are already easily solvable for an issue without a solution and often a way worse user experience (all issues from a series suddenly tagged as special and no longer properly sorted).

Finally if we decide to keep the current fix, the regular expression is unnecessary complex. To achieve, the desired goal of matching Annual, Absolute, etc the more efficient regex would be simply:

$@"\b(?:{CommonSpecial}|Annual|Book \d.+?|Compendium|Omnibus|FCBD \d.+?|Absolute|Preview|Hors[ -]S[ée]rie|TPB|HS|THS)\b"
majora2007 commented 1 year ago

Hmmm...you make some good points and it's good to also learn more about how some keywords are used in comics, as I do not collect them myself. As we can support via SP# or ComicInfo Special tag, it does make sense to reduce false positives rather than support more loose rules.

Do you suggest removing Absolute and Annual (without year) altogether? I need guidance on comic support. I'm also not sure how often these keywords are used.

robsonsobral commented 1 year ago

First comment! I just have found this project and I'm still reading everything about it. It looks really cool.

Why not to require separators? So the word "Annual" would only be recognized in this case if there's - before it? #, -, (, ), [ and ] could be used as separators.

Too bad idea?

zzyzx-dc commented 1 year ago

This occurred for me as well. As a test, I created a file Amazing Spider-Man issue 001 as well as Amazing Spider-Man Annual 001. They're lumped together as the same issue in Kavita.

image

"OneShot" followed by a number also results in the special being listed as a duplicate of an issue in the library view, so I think the keywords are simply ignored if there is an issue number.

"Annual" or "OneShot'" by itself is recognized and placed under the "Special" pane in the library view.

Finally, putting the number in parentheses, like 'Annual (001)', will separate out the issues appropriately into the Specials pane - without a number, probably since Kavita ignores parentheses, but the number is in the displayed name, so that seems like a decent workaround for now.

majora2007 commented 3 months ago

I know this is an old issue, but I'm now considering removing special keyword parsing. Would love feedback on the other issue: https://github.com/Kareadita/Kavita/issues/2967