furkleindustries / ifdb-scraper

A simple module and CLI tool for scraping entry metadata from the Interactive Fiction Database.
1 stars 0 forks source link

Scraping pages slows down IFDB #1

Closed qdacsvx closed 5 years ago

qdacsvx commented 5 years ago

Perhaps there should be a limit on how many pages can be downloaded per search say 50.

You can download a CSV database dump from IFArchive for offline searches.

http://mirror.ifarchive.org/if-archive/info/ifdb/

curiousdannii commented 5 years ago

The IFDB API docs say

The APIs are designed primarily for low-volume, user-driven applications, rather than big batches of automated requests. They're not really meant for constantly crawling the database, high-speed mirroring, etc. The servers that IFDB runs on are from a shared hosting service marketed for personal and small business sites, so they're not set up for high scalability. Please use the APIs with this in mind. If you're contemplating a usage that might put a lot of load on the servers, please consider technical measures that would minimize the impact, such as adding artificial delays while generating requests to throttle the rate.

So long as this tool is searching individual terms I think it's fine.

qdacsvx commented 5 years ago

The options include the ability to search for multiple items.

-p, --published The years to include. Ranges are allowed.

I understand that the scraper generates additional downloads to get information from game pages for a "deep" scrape. In that case it could request 100s of game pages.

please consider technical measures that would minimize the impact, such as adding artificial delays while generating requests to throttle the rate.

A user using the scraper shouldn't cause a heavier server load than other IFDB users. It would be appropriate to delay 5 seconds between each page download if multiple downloads are necessary.

furkle commented 5 years ago

So, as it stands, there are somewhat bespoke, hard-coded, and seemingly necessary (given that I could not complete abstract "requests" through this tool without it) rate limits I have introduced for practical reasons. This results in a noticeable, multi-second delay after every search-page-block is parsed.

This is the line of code in which this delay occurs: https://github.com/furkleindustries/ifdb-scraper/blob/55c2ae04ca5953032c17bd6c213988a85e1199f8/handleIfdbResponse.js#L70

If there is a sense that this limit should be adjusted upwards from where it stands now, or that a simple option should be included in order to allow especially conscientious users to do so, that's fine. Anything more complex or onerous than that -- which is to say artificial delays on the order of five seconds per request, resulting in expected minimums of 30 minutes or greater for even single runs of its intended institutional usage by the XYZZY committee -- is unlikely given that this repository has already resulted more conversation about its potential impact than it likely has real-world usage.

I am also in general unlikely to be swayed towards doing away with large aspects of the intended usage for this tool just because IFDB is in an extremely awkward and frankly untenable position as both indispensable community resource and big ball of mud so underresourced it might inconvenience others to create a single, up-to-date walk of a few hundred entries in less than 5 minutes.