SenorSmartyPants / Comixology-Scraper

GNU General Public License v3.0
35 stars 3 forks source link

[Feature Request] Manual select #4

Open rohan-gt opened 3 years ago

rohan-gt commented 3 years ago

Hi, is it possible to add an option to manually select an issue if it's not scraped automatically? I'd rather have Comixology info on all my books rather than ComicVine

SenorSmartyPants commented 3 years ago

There's no UI to manually select an issue. But you can edit the notes in ComicRack (or directly in ComicInfo.xml) and add the issue ID from comixology.com

https://www.comixology.com/Fantastic-Four-2018-1/digital-comic/687096

Copy the numbers at the end of the URL as shown above, and add it to notes in this format

[CMXDB687096]

Run the scraper again and it will use the id from the notes.

If you run comicrack with this option "C:\Program Files\ComicRack\ComicRack.exe" -ssc

It will display the script console. I'd be curious to see the output from a couple comics that aren't getting matched.

rohan-gt commented 3 years ago

@SenorSmartyPants how do I contact you? I have a few ideas and would like to contribute to this project if I can

SenorSmartyPants commented 3 years ago

You're doing it.

rohan-gt commented 3 years ago

Okay @SenorSmartyPants some suggestions:

  1. Is would be useful to download the entire Comixology metadata into a local database with some filters like publisher to limit the data and then simply fetch the data from it to populate the comic info
  2. I don't know how the fuzzy matching is done at the moment but I believe it is possible to improve the logic and match comics to a very high degree since we are only matching digital releases and they usually have clean names as opposed to scans
  3. It would be useful to have some kind of matching between collected editions and single issues. I believe this info is already available in Comixology. This info can be then used to detect duplicates, missing issues etc.
SenorSmartyPants commented 3 years ago

Each one of these items should have been a separate issue. But here goes:

  1. I'm not going to scrape the entire comixology site. There is not API provided by comixology to download all the metadata for everything. Select your issues and download for each of them.
  2. If you have specific examples of issues not being found (with filenames provided) I would be interested to see them. Getting good search results is something I am having issues with, but mostly because of google's bot detection.
  3. This is not a library management tool. Try ComicRack for finding duplicates. If collected edition information is ever scraped, where would it be stored?
rohan-gt commented 3 years ago
  1. Ah, okay I thought there was an API similar to ComicVine
  2. I'll try to get some examples out
  3. By duplicates I meant if you have both the trade paper back as well as the single issues within them separately, it would be useful to point those out. Comixology actually has the single issue links under the TPB page so if it's possible to store the IDs of the single issues within the TPB XML, you can reference it easily
rohan-gt commented 3 years ago

@SenorSmartyPants So I have files named: The Books of Magic (1993) (Digital).cbr Aquaman (2011-2016) Vol. 1 The Trench.cbr which aren't scraped

SenorSmartyPants commented 3 years ago

These look like graphic novels or trade paperbacks. The search is currently pretty specific to single issues. I'll see what I can do (assuming I'm not blocked by Google).

Are you scraping in comicrack or with the mylar version?

What's in the comicinfo.xml? If you are in ComicRack you can select a book and right click 'copy data' to get that info.

rohan-gt commented 3 years ago

@SenorSmartyPants yes they are TPBs. I'm using ComicRack. There's no info generated since I get a message saying 0 comics scraped, 1 skipped. It seems easy to implement. You just need to fuzzy match the name with a high percentage score (95%) along with the year if it is provided

SenorSmartyPants commented 3 years ago

Comicrack will parse the file name, so there's probably proposed values at least of these books. So I'd still like a copy data output. And console output, which you can get if you run CR with the a shortcut like this "C:\Program Files\ComicRack\ComicRack.exe" -ssc

You're welcome to submit a pull request as well. But I won't merge it until I can test it not working (which is why I want the data I'm asking for).