cbanack / comic-vine-scraper

An add-on script for ComicRack that lets you copy details from Comic Vine into your comic books.
255 stars 48 forks source link

This app is getting is users banned #421

Closed edgework closed 8 years ago

edgework commented 8 years ago

We don't allow scraping of our site and anyone who uses this gets banned. You need to limit it to only make API calls or remove it entirely.

To be clear ComicRack is fine. /api calls are fine. Any other request made by a bot isn't fine.

edgework commented 8 years ago

excerpt from our logs

"client_ip","time","type","url","referrer","user_agent","cdn","day_pass","result","bytes","client_user","proxied_ips" "127.0.0.1","2015-10-05 19:46:31","GET","api/search/?api_key=xxx&client=cvscraper&format=xml&limit=100&resources=volume&field_list=name,start_year,publisher,id,image,count_of_issues&query=2000ad?api_key=xxx&client=cvscraper&format=xml&limit=100&resources=volume&field_list=name,start_year,publisher,id,image,count_of_issues&query=2000ad","-","-","","-",200,119407,"","-" "127.0.0.1","2015-10-05 19:46:43","GET","api/issue/4000-438277/?api_key=xxx&client=cvscraper&format=xml?api_key=xxx&client=cvscraper&format=xml","-","-","","-",200,16876,"","-" "127.0.0.1","2015-10-05 19:47:43","GET","api/issues/?api_key=xxx&client=cvscraper&format=xml&field_list=name,issue_number,id,image&filter=volume:19752,issue_number:1863?api_key=xxx&client=cvscraper&format=xml&field_list=name,issue_number,id,image&filter=volume:19752,issue_number:1863","-","-","","-",200,1197,"","-" "127.0.0.1","2015-10-05 19:47:58","GET","issue/4000-442648/","-","-","","-",200,60979,"","-"

The api/ requests are fine. The issue/ request made by a bot triggers scrape protection. That you cannot do. Don't change the UA either to impersonate a browser or GoogleBot that will also get the user banned.

edgework commented 8 years ago

...and yes this worked before. Our Firewall got better at finding scraping. Comicvine is allowed to post much of its content including copyrights held by others because of agreements with those copyright holders. Those agreements strictly disallow anything other than a web browser from accessing the content. We cannot make any changes to accommodate your app.

cbanack commented 8 years ago

Since I have shut Comic Vine Scraper down, I will not be implementing these changes. I will, however, leave a few details to address edgework's suggestions here, in case someone decides to fork this project.

The culprit for the undesired /issue requests is the _query_issue method in the cvdb.py module. When called with the slowdata parameter set to true, it performs the /issue request that edgework is referring to above. So the obvious solution is to remove that parameter and the code in that method that queries for slow (/issue request) data.

Then you will have to remove the portions of the main app that made use of that slow data. There isn't too much--the main problem is that this will cripple the automatic cover matching algorithm, which will become a lot less effective since now it will only be able to try matching against the single comic book cover that is provided by the Comic Vine API.

cbanack commented 8 years ago

An additional change that has been requested by Comic Vine is to limit calls to their API to no more than 1 per second. This could be implemented in the cvconnection.py module, in the _get_dom method.

Unfortunately, given the number of requests that the scraper needs to make in order to perform it's basic function (finding comic books in the API and downloading information about them), restricting the API to 1 call per second will very noticeably degrade the performance of the application. Some people may still be ok with it, though.

cbanack commented 8 years ago

This issue has been fixed in the latest 1.0.90 release (see code changes here). Comic Vine Scraper no longer directly accesses any pages from the Comic Vine website except for api/ calls. This should mean that users no longer fall afoul of Comic Vine's 'scrape protection' algorithms.

As a result of these changes, the scraper has lost some functionality (as described in my comments above.) If the Comic Vine API changes to provide API users with access to Comic Vine's list of alternate covers for each issue, it would then be a very simple thing to update my modifications and reinstate almost all of the reduced functionality.