ajslater / comicfn2dict

Parse common comic filenames and return a dict of metadata attributes
GNU General Public License v3.0
1 stars 2 forks source link

Additional Test Cases #2

Closed lordwelch closed 7 months ago

lordwelch commented 8 months ago

I've included some of the test cases from ComicTagger here that this project doesn't handle the same way, these are a bit opinionated so not all of them necessarily need to be "fixed". I put in comments explaining how or why CT handles most of them, I also left out the scan_info/remainders as CT and this project have different cleanup strategies for them.

Also note that CT (with the complicated parser) parses all of these the same whether the # is there or not (except for the first one, that one it doesn't find the issue number)

FNS.update(
    {  # Issue number starting with a letter requested in https://github.com/comictagger/comictagger/issues/543
        "batman #B01 title.cbz": {
            "ext": "cbz",
            "issue": "B01",
            "series": "batman",
            "title": "title",
        },  # Leading issue number is usually an alternate sequence number
        "52 action comics #2024.cbz": {
            "ext": "cbz",
            "issue": "2024",
            "series": "action comics",
            "alternate": "52",
        },  # 4 digit issue number
        "action comics 1024.cbz": {
            "ext": "cbz",
            "issue": "1024",
            "series": "action comics",
        },  # Only the issue number. CT ensures that the series always has a value if possible
        "#52.cbz": {
            "ext": "cbz",
            "issue": "52",
            "series": "52",
        },  # CT treats double-underscore the same as double-dash
        "Monster_Island_v1_#2__repaired__c2c.cbz": {
            "ext": "cbz",
            "issue": "2",
            "series": "Monster Island",
            "volume": "1",
        },  # I'm not sure there's a right way to parse this. This might also be a madeup filename I don't remember
        "Super Strange Yarns (1957) #92 (1969).cbz": {
            "ext": "cbz",
            "issue": "92",
            "series": "Super Strange Yarns",
            "volume": "1957",
            "year": "1969",
        },  # Extra - in the series
        " X-Men-V1-#067.cbr": {
            "ext": "cbr",
            "issue": "067",
            "series": "X-Men",
            "volume": "1",
        },  # CT only separates this into a title if the '-' is attached to the previous word eg 'aquaman- Green Arrow'. @bpepple opened a ticket for this https://github.com/ajslater/comicfn2dict/issues/1 already
        "Aquaman - Green Arrow - Deep Target #01 (of 07) (2021).cbr": {
            "ext": "cbr",
            "issue": "01",
            "series": "Aquaman - Green Arrow - Deep Target",
            "year": "2021",
            "issue_count": "7",
        },
        "Batman_-_Superman_#020_(2021).cbr": {
            "ext": "cbr",
            "issue": "020",
            "series": "Batman - Superman",
            "year": "2021",
        },
        "Free Comic Book Day - Avengers.Hulk (2021).cbz": {
            "ext": "cbz",
            "series": "Free Comic Book Day - Avengers Hulk",
            "year": "2021",
        },  # CT assums the volume is also the issue number if it can't find an issue number
        "Avengers By Brian Michael Bendis volume 03 (2013).cbz": {
            "ext": "cbz",
            "issue": "3",
            "series": "Avengers By Brian Michael Bendis",
            "volume": "03",
            "year": "2013",
        },  # Publishers like to re-print some of their annuals using this format for the year
        "Batman '89 (2021) .cbr": {
            "ext": "cbr",
            "series": "Batman '89",
            "year": "2021",
        },  # CT has extra processing to re-attach the year in this case
        "Blade Runner Free Comic Book Day 2021 (2021).cbr": {
            "ext": "cbr",
            "series": "Blade Runner Free Comic Book Day 2021",
            "year": "2021",
        },  # CT treats book like 'v' but also adds it as the title (matches ComicVine for this particular series)
        "Bloodshot Book 03 (2020).cbr": {
            "ext": "cbr",
            "issue": "03",
            "series": "Bloodshot",
            "title": "Book 03",
            "volume": "03",
            "year": "2020",
        },  # CT checks for the following '(of 06)' after the '03' and marks it as the volume
        "Elephantmen 2259 #008 - Simple Truth 03 (of 06) (2021).cbr": {
            "ext": "cbr",
            "issue": "008",
            "series": "Elephantmen 2259",
            "title": "Simple Truth",
            "volume": "03",
            "year": "2021",
            "volume_count": "06",
        },  # CT catches the year
        "Marvel Previews #002 (January 2022).cbr": {
            "ext": "cbr",
            "issue": "002",
            "series": "Marvel Previews",
            "year": "2022",
        },  # c2c aka "cover to cover" is fairly common and CT moves it to scan_info/remainder
        "Marvel Two In One V1 #090  c2c.cbr": {
            "ext": "cbr",
            "issue": "090",
            "series": "Marvel Two In One",
            "publisher": "Marvel",
            "volume": "1",
        },  # This made the parser in CT much more complicated. It's understandable that this isn't parsed on the first few iterations of this project
        "Star Wars - War of the Bounty Hunters - IG-88 (2021).cbz": {
            "ext": "cbz",
            "series": "Star Wars - War of the Bounty Hunters - IG-88",
            "year": "2021",
        },  # The addition of the '#1' turns this into the same as 'Aquaman - Green Arrow - Deep Target' above
        "Star Wars - War of the Bounty Hunters - IG-88 #1 (2021).cbz": {
            "ext": "cbz",
            "issue": "1",
            "series": "Star Wars - War of the Bounty Hunters - IG-88",
            "year": "2021",
        },  # CT treats '[]' as equivalent to '()', catches DC as a publisher and 'Sep-Oct 1951' as dates and removes them. CT doesn't catch the digital though so that could be better but I blame whoever made this atrocious filename
        "Wonder Woman #49 DC Sep-Oct 1951 digital [downsized, lightened, 4 missing story pages restored] (Shadowcat-Empire).cbz": {
            "ext": "cbz",
            "issue": "49",
            "series": "Wonder Woman",
            "title": "digital",
            "publisher": "DC",
            "year": "1951",
        },  # CT notices that this is a full date, CT doesn't actually return the month or day though just removes it
        "X-Men, 2021-08-04 (#02).cbz": {
            "ext": "cbz",
            "issue": "02",
            "series": "X-Men",
            "year": "2021",
        },  # CT treats ':' the same as '-' but here the ':' is attached to 'Now' which CT sees as a title separation
        "Cory Doctorow's Futuristic Tales of the Here and Now: Anda's Game #001 (2007).cbz": {
            "ext": "cbz",
            "issue": "001",
            "series": "Cory Doctorow's Futuristic Tales of the Here and Now",
            "title": "Anda's Game",
            "year": "2007",
        },  # This is a contrived test case. I've never seen this I just wanted to handle it with my parser
        "Cory Doctorow's Futuristic Tales of the Here and Now #0.0.1 (2007).cbz": {
            "ext": "cbz",
            "issue": "0.1",
            "series": "Cory Doctorow's Futuristic Tales of the Here and Now",
            "year": "2007",
            "issue_count": "",
        },
    }
)
ajslater commented 7 months ago

I can't thank you enough for these test cases. I changed a couple big things about the philosophy of the parser because of these. One being that I no longer divide tokens by the - character.

Most of these test cases now pass with version 0.2.0

Maybe two of the test cases pass with slightly different dict data than you provided due to small differences of opinion. Two of the test cases I elected not to fix, again due to a difference of opinion:

WONFIX = {
    # Leading issue number is usually an alternate sequence number
    #   WONTFIX: Series names may begin with numerals.
    "52 action comics #2024.cbz": {
        "ext": "cbz",
        "issue": "2024",
        "series": "action comics",
        "alternate": "52",
    },
    # Only the issue number. CT ensures that the series always has a value if possible
    #   WONTFIX: I don't think making the series the same as the number is valuable.
    "#52.cbz": {
        "ext": "cbz",
        "issue": "52",
        "series": "52",
    },
}

I am open to new ideas and opinions about how this works, so if you feel a way about any of this feel free to pipe up. comicfn2dict is primarly used by comicbox which is similar to comictagger, in that it manually tags comics and reads a variety of comic tag formats, but doesn't do any of the really useful or difficult stuff with online comic databases and identification, or have a nice gui.