cbanack / comic-vine-scraper

An add-on script for ComicRack that lets you copy details from Comic Vine into your comic books.
244 stars 47 forks source link

ALT_SEARCH_REGEX being ignored/not using filename if Proposed Values exist. #420

Open wingot-git opened 8 years ago

wingot-git commented 8 years ago

tl;dr: CVS seems to be ignoring ALT_SEARCH_REGEX and the filename if a Series is already populated.

Full info: I have been trying to run a complete scrape of my collection from scratch (only recently started using ComicRack) and was getting annoyed at filenames such as "Exiles 003 - Old Wounds, New Battles Part 1.cbz". CVS seems to pull out the number and then search the rest of the text as the series name, coming back to me complaining that it doesn't exist and requiring me to change the search term to just "Exiles" in this example. Every single part of each story (i.e., every single issue) fails due to the number at the end changing. So I decided to setup an ALT_SEARCH_REGEX using the information here: https://github.com/cbanack/comic-vine-scraper/wiki/Advanced-Settings.

I have gone into the settings window, clicked Advanced and added this line: "ALT_SEARCH_REGEX=(?P.+)\s(?P\d+).+-.+(cbz|cbr)". I have also tested this regex at https://regex101.com/ which shows these results: MATCH 1 series [0-6] Exiles num [7-10] 003

  1. [44-47] cbz

As such there doesn't seem to be an issue with the regex. But when I attempt to then run the scraper it fails and complains that the search term "Exiles - Old Wounds" is not found. This search term is what is in my Series field in ComicRack's Info panel (I assume automatically added during the file import as I did not put this here). In the CVS window (with the cover display) it does correctly list the filename. Additionally I have found that setting Proposed Values to No and clearing the Series field for a comic will make ALT_SEARCH_REGEX work.

IMO ALT_SEARCH_REGEX (as it is tied to the filename and is a specifically added setting) should take preference over any and all existing information in the Info when attempting to search Comic Vine.

cbanack commented 8 years ago

I'll have to look at the code more closely to confirm this, but if I remember right, the problem you're facing here is more an issue of the scraper asking itself "should I search Comic Vine based on the user-entered value for the comic name, or the comic's actual filename?"

99% of the time, the scraper uses the comic's filename because people do not generally fill out part of the data for a comic before they scrape it, so there's no user entered series name to use. But in those rare cases when there is user-supplied data, the scraper uses it.

The "Proposed Values" option is effectively telling ComicRack "store the values that you can find from the filename as metadata for this comic, just as if I had typed those values in myself". In other words, it is populating your comics with user-entered metadata based on however ComicRack decides to parse your each comic's filename.

So that means if Proposed Values is on, the scraper is using the "proposed" series name instead of the comic's actual filename to search Comic Vine. And it is likely not using ALT_SEARCH_REGEX at all (I'll have to check code to confirm).

So really, an easy workaround for you is probably just to turn "Proposed Values" off for any comics that you intend to scrape. But as far as that goes, I believe that setting is turned off by default anyway--is it possible that you accidentally turned it on for a bunch of comics somehow?

wingot-git commented 8 years ago

Ok, my apologies.

I have taken a further look into this (with the Exiles comics listed above) and it appears that for those particular comics even turning Proposed Values off has these fields populated. And if I leave proposed values on and delete the contents of the field CVS uses the ALT_SEARCH_REGEX correctly. So the issue looks like that those particular fields were already populated (probably by the person that I got these files from) on these files.

I can understand the value in using the Series field for searching (for example I have 100 Iron Man comics with filenames of IM###.cbz where it's much simpler to just enter Iron Man into the series name). I just personally would have thought that setting up a regex parser on the filename would have taken preference anyway.

I don't know if you want to close this ticket given that the "issue" is solved or leave it open as an enhancement request (to respect REGEX over populated fields). Thank you either way.

cbanack commented 8 years ago

I'll leave it open for now, so I can take a closer look the next time I'm doing development on the scraper. Thanks for your input. :)