QasimK / mal-scraper

MyAnimeList web scraper is a Python library for gathering data for analysis
MIT License
19 stars 9 forks source link

Use Wayback Machine Archive to Prevent Accidental AntiDDOS Blocking During Scraping #12

Open Skylion007 opened 8 years ago

Skylion007 commented 8 years ago

So I know many people have trouble due to the fact that MyAnimeLIst no longer whitelists new IP addresses from their antiDDOS software which leads to many people struggling to scrape data of the website. An workaround I discovered is to access the website's archive.org backup instead of the website itself. Does this package allow you to do this? If not, it doesn't seem like it'd be a very difficult to add as a nice feature. You could even update archive.org's backup by requesting that pages that haven't been indexed by the wayback machine are added (through archive.org's API).

QasimK commented 8 years ago

Using the internet archive as a secondary source is a good idea as it (currently) indexes MAL. The package does not currently allow this as it is still very much WIP. I plan to add this as a post version 1 feature.

The Wayback Machine supports a simple API allowing you to discover the history of a page and the right URLs to use. The page itself should parse similarly.

"Memento" may offer additional sources/metadata.