Problem searching for comic with many special characters and a slash in its series title

hotcereal commented 6 months ago

Description of the bug

The only comic that I can immediately test with is "Batman/Superman: World’s Finest (2022)"

Searching for any issue results in empty results, but manually placing the comic in its respective folder does allow it to be matched if named properly.

For ongoing series, it presents a problem wherein users will have to manually track and download comics as they come out. For older comics, manually importing the comic is a one-time ordeal; much less frustrating. It is worth noting that the comic is available via GC and the links appear to be bountiful for each issue.

To Reproduce

Go to 'Add Volume'.
Search for "Batman/Superman: World’s Finest (2022)" and 'Add Volume'.
Click on 'Search Monitored' or manually search by clicking the human glyph icon.
Should result in no results, an empty list.

Expected behaviour

See results, download them, have them mapped/matched.

Screenshots

Version info

Kapowarr version: v1.0.0-beta-4 Python version: 3.8.17.final.0 Database version: 14

Additional context

This is the only comic that has this specific issue. I can't seem to figure out what it could be, but my leading hunch is that it may be that the slash is directlly between two letters the same way a file directory would be. Making me believe the cause may be that Kapowarr is looking for a directory (to some extent) as opposed to the comic itself.

If it's not that, my only guess can be the abundance of special characters. The name of the series has 2 ( / and -) and a lot of the titles feature a semicolon. However, the semicolons in issue titles is relatively common.

Casvt commented 6 months ago

I don't know exactly where or when I fixed this, but on the development branch it seems to already be fixed. So just wait for the next release and you'll have the fix too.

hotcereal commented 6 months ago

Would the problem that caused this also be the source of the problem that makes issues unmatch when their file name has numbers at the beginning or ones that don't pertain to the issue or volume?

For example:

The Wicked + The Divine (2014) - 001 Once Again - [2014-06-30].cbr will work fine with little to no issue. However, if the files name were to be 001 Once Again - The Wicked + The Divine (2014) - [2014-06-30].cbr then a new scan will make it unmatch.

The same issue can be found with titles like The Wicked + The Divine (2014) - 035 1-2-3-4! ; The Curse in My Hands - [2018-04-30] but removing the 1-2-3-4 will make it match perfectly fine.

Casvt commented 6 months ago

To start with, these are all unrelated problems. The search issue did not have to do with Kapowarr thinking it's a folder. Matching files to volumes is not related too. But that isn't a problem.

Kapowarr matches files with an algorithm that uses patterns to extract data from the filename. Based on this data, it checks if the file is a match and for which issue. The algorithm is not working completely correct with these filenames for various reasons.

The filename The Wicked + The Divine (2014) - 001 Once Again - [2014-06-30].cbr works fine:

{
"series": "The Wicked The Divine",
"year": 2014,
"volume_number": 1,
"special_version": null,
"issue_number": 1.0,
"annual": false
}

The filename 001 Once Again - The Wicked + The Divine (2014) - [2014-06-30].cbr does not work, and I agree with that. Here, 001 Once Again means issue 1 and issue title Once Again. But how could we differentiate that from 100 Bullets Volume 2 Issue 3, where the volume is called 100 Bullets? We can fix it by explicitly stating that the next number is the issue number. See below:

# BAD: 001 Once Again - The Wicked + The Divine (2014) - [2014-06-30].cbr
{
"series": "001 Once Again The Wicked The Divine",
"year": 2014,
"volume_number": 1,
"special_version": "tpb",
"issue_number": null,
"annual": false
}
# GOOD: Issue 001 Once Again - The Wicked + The Divine (2014) - [2014-06-30].cbr
{
"series": "", # Series name after issue number is not supported; it'll be extracted from the folder name
"year": 2014,
"volume_number": 1,
"special_version": null,
"issue_number": 1.0,
"annual": false
}

The filename The Wicked + The Divine (2014) - 035 1-2-3-4! ; The Curse in My Hands - [2018-04-30] doesn't work because the algorithm thinks that the issue title is the issue range (it thinks the file covers the issues -3 to 4). It is most likely that I can not fix this (without breaking other stuff), but I'll at least try and will get back to you either way.
```
{
"series": "The Wicked The Divine",
"year": 2014,
"volume_number": 1,
"special_version": null,
"issue_number": [
    -3.0,
    4.0
],
"annual": false
}
```

hotcereal commented 6 months ago

Insanely useful information, thank you.

Casvt commented 6 months ago

Okay I checked the code and tested stuff out.

Firstly, I advised to add 'Issue' to the start to specifically signify that the number is the issue number. Reading the code, I got reminded of an alternative that I had added support for, which is adding a dash:

# BAD: 001 Once Again - The Wicked + The Divine (2014) - [2014-06-30].cbr
{
    "series": "001 Once Again The Wicked The Divine",
    "year": 2014,
    "volume_number": 1,
    "special_version": "tpb",
    "issue_number": null,
    "annual": false
}
# GOOD: Issue 001 Once Again - The Wicked + The Divine (2014) - [2014-06-30].cbr
{
    "series": "", # Series name after issue number is not supported; it'll be extracted from the folder name
    "year": 2014,
    "volume_number": 1,
    "special_version": null,
    "issue_number": 1.0,
    "annual": false
}
# ALSO GOOD: 001 - Once Again - The Wicked + The Divine (2014) - [2014-06-30].cbr
{
    "series": "", # Series name after issue number is not supported; it'll be extracted from the folder name
    "year": 2014,
    "volume_number": 1,
    "special_version": null,
    "issue_number": 1.0,
    "annual": false
}

Secondly we have the 1-2-3-4 issue title messing up the algorithm. I managed to fix the algorithm without breaking anything else (which is a small miracle):

# The Wicked + The Divine (2014) - 035 1-2-3-4! ; The Curse in My Hands - [2018-04-30]
{
    "series": "The Wicked The Divine",
    "year": 2014,
    "volume_number": 1,
    "special_version": null,
    "issue_number": 35.0,
    "annual": false
}

This fix will be available in the next release.

EDIT:

Just for reference, there is a file in the project full with tests for the algorithm. Each and every single filename in that file needs to be correctly processed by the algorithm, at all times, at the same time, without being dependent on each other. It's pretty complicated to alter the algorithm (in order to fix it for filenames like you presented) without it breaking for any single other filename in that file. Together, these tests try to cover as many possible names as possible.

hotcereal commented 5 months ago

Thanks for this, I genuinely appreciate it. It'll definitely help me organize my files in a more cohesive way going forward.

Casvt / Kapowarr

Problem searching for comic with many special characters and a slash in its series title #147