alephdata / memorious

Lightweight web scraping toolkit for documents and structured data.
https://docs.alephdata.org/developers/memorious
MIT License
311 stars 59 forks source link

Put a stage on timeout when it hits a rate limit #134

Closed sunu closed 4 years ago

sunu commented 4 years ago

Works for me, although I'm not a massive fan of the notion of "namespaced stage names" - it's super confusion to follow these. Perhaps this could be done in the two functions that actually need to access them (timeout, and check timeout)?

The main user of the namespaced stage names is get_stages. The problem without properly namespaced stages is that if fetch stage of crawler1 is on timeout, we don't want all fetch stages across all crawlers to be on timeout.

This is where the meaning of a stage differs in Aleph and Memorious. In Aleph a stage named INGEST is doing the same thing for all datasets. But in Memorious a stage named fetch can do different things in different crawlers. With namespaced stage names, we can rate limit just the fetch stage of crawler1 instead of rate limiting fetch stage in all crawlers.