cbanack / comic-vine-scraper

An add-on script for ComicRack that lets you copy details from Comic Vine into your comic books.
243 stars 47 forks source link

Comic Vine API returns too many results #460

Closed metalracket closed 5 years ago

metalracket commented 5 years ago

As of today when I'm scrapping my comics it is giving me a list of all series that contains any word from the title of comic

Example:

Scraped "Deadman: Dark Mansion of Forbidden Love" got back 1689 matches including a series titled "Love"

Updated from 1.0.92 to 1.0.96 but no change scrap it all

Targg commented 5 years ago

I am having the same issue, and posted on the CR forum, but since it is overrun by bots I am not sure if anyone will see it there.

As of yesterday afternoon, I am seeing some odd behaviors with the scraper. I had been scraping about 1000 comics a day this week, and yesterday the behavior changed.

It seems to now be giving results based on any of the words in the title. For example "X-Men vs Fantastic Four" would show results for "X-Men" "Fantastic" "Four" and "vs". The results list takes literally minutes to populate (it took seconds a few days ago), and the number of results shows in the thousands.

Nothing on my end has changed. I hadn't even closed ComicRack when the behavior changed.

Is there a way to limit the amount of search results? Anything with a common word or comic term (such as Spider-Man) takes forever, even when the specific title is very unique.

Thanks in advance for any help!

boshuda commented 5 years ago

I haven't investigated this at all, and haven't used the scraper in several days. However, if the behaviour has actually changed then it's likely that ComicVine changed their search function again. They have done it before without notifying the people who use their API. Give it some time and/or check their forums. In the meantime the workaround that I've found most effective is to pick the word least likely to be in hundreds of matches and search on that. It's also helpful if you fill in as much of the data as reasonable first. If you visit comicrack's forums there is a script hidden in there somewhere that can autofill the volume. That alone can help a lot before scraping. If you can put the ComicVine id in place, that is usually the most surefire method I know of to get past search issues. Again, the comic rack forums (unless the spam has destroyed them) has some more information on that.

Targg commented 5 years ago

To give a specific example, a search for "Untold Tales of the New Universe: Psi-Force" comes back with 3,194 results, so it's giving results for each word.

cbanack commented 5 years ago

Hi guys,

I'm sure it must be something like what boshuda says -- the API has changed. They just seem to do that every now and then.

I'll look into it more when I get a chance (probably this weekend); hopefully it's a simple fix like it was last time they did this.

Cory

cbanack commented 5 years ago

Ok, I've looked into this a bit, and I can confirm that I'm seeing the same behaviour (search terms are being OR'd now, when they used to be AND'd). It definitely slows things down, but it's nothing wrong with the scraper. Comic Vine has silently decided to change how many results their search API returns.

Unfortunately, I don't really have a solution to the problem -- the fix pretty much as to be on their end.

So I've posted a message in their forums explaining the matter:

https://comicvine.gamespot.com/forums/api-developers-2334/apis-search-resource-broken-1990887/

Also, I recall that we had this exact same problem about a year ago:

https://www.giantbomb.com/forums/bug-reporting-33/new-search-system-going-live-1817994/?page=2#js-message-8775319

Back then, ComicVine ultimately fixed the problem on their end. In a lot of ways, this feels like they've recently lost that fix somehow. But maybe they'll be willing to do it again.

fieldhouse commented 5 years ago

One thing that really helps as a workaround is Xelloss' Autocomplete Volume Values script.
It helps speed up the guesswork/pre-processing by adding the comicvine_volume custom value to any recognized comics. The scraping for those recognized issues goes back to the previously experienced rate (probably a little faster since it's jumping straight to the one series). http://comicrack.cyolito.com/forum/13-scripts/39594-auto-complete-volume-values-script-beta Make sure you grab the most recent version - 1.8 last I checked - http://comicrack.cyolito.com/forum/13-scripts/39594-script-auto-complete-volume-values-script-beta?start=10#48108

Xelloss-nakama commented 5 years ago

It is not exactly a FULL fix, but you can try my old patch (which still works this time) to scrap comics while this is fixed...

http://comicrack.cyolito.com/forum/32-news-and-announcements/33534-comic-vine-scraper?start=1630

(you have to replace a file in the excelent comicrack scraper script for it to works)

What the fix does is stop asking for results when the results are "good enough". This works because the new API search engine sort the best results first (the most similar to your search). It is not a 100% solution, but it works most of the times...

All in all, I don't think comicvine think of this change as a bug.. it is just about the policy of how the search engines works...

Till now if you look for (for example) New Captain America, the engine searched for the string "new captain america" in comics series strign ... and just gave those results...

Now, as many months ago, if you do the same search, the engine search for strings that contains "new" OR" (this OR is the problem) captain" OR "america". Giving hundreds and hundreds of results... This would be normally not a problem, because in a web search you stop looking when you find the results (as in google) and so "the most results the better"... The problem is that comicvine scraper script load ALL the results given before continue and this takes times...

The good news is that the engine is well made and first try searching results with AND, so it first returns "new" AND "captain" AND "america, then "new AND captain", "captain AND america" and "new AND america", and lastly "new OR captain OR america". If you take a look at this, you understand that first the results have 3 ok words, then 2, then only 1. What my fix does is to first see how many ok words the first result has (for example 3), and stop asking for results when a result in the list has less than this one (deleting from the view the results from this one on, and stoping the results loading)

All the same this type of behaviour is absolutely process consuming, so I think sooner or later they will return to the old behaviour... as they did the last time.

ps: I call this a temporary fix, because I don't know or understand 99% of the comicvine scraper code... so I think this as only a patch that could be breaking many of cbanack features, so use it under your own risk till cbnack release an official solution if he does.

cbanack commented 5 years ago

Hey guys,

I finally got a chance to do something about this issue. (To be honest, I've been kinda hoping that Comic Vine would address the problem on their end, but since that doesn't seem like it's going to happen, I've patched a sort-of-decent solution into the official version of the scraper.)

The following is loosely based on what Xelloss did in his patch (mentioned above).

There is now a new advanced property called MAX_SEARCH_RESULTS which can be used to determine how many results the scraper pulls down from Comic Vine when you search for a comic series. The default value is set to 100, which is a good value since that's how many results come down with a single request to the Comic Vine API. (And it seems to work well enough for most scrapes.)

You can adjust this value in the advanced settings section of your scraper config, but most people will not need to do this. Just install the latest version of the scraper, and things will more or less work properly again...except when they don't of course. :)

This change assumes that the series you are looking for is always in the first 100 results that ComicVine returns. which it usually is. But if it is not, it will seem like the scraper cannot find your series. You could fix this by increasing the MAX_SEARCH_RESULTS value, but that will make all your other searches slower.

A better solution is to simply paste the URL of the Comic Vine webpage for your series directly into the search dialog of the scraper. This will automatically force the scraper to find that series. For example, to find Detective Comics, just search for:

http://www.comicvine.gamespot.com/detective-comics/4050-18058/

or even just:

4050-18058

Another good option might be to sort your series into folders and create a cvinfo file, as described here:

https://github.com/cbanack/comic-vine-scraper/wiki/CVINFO-and-CVDB-tags