covidatlas / li

Next-generation serverless crawler for COVID-19 data
Apache License 2.0
57 stars 33 forks source link

End to end integration tests for `sources` #3

Open ryanblock opened 4 years ago

ryanblock commented 4 years ago

All sources scrapers (both crawl and scrape) should be subject to end to end integration tests, wherein both are exercised against live cache or internet.

Crawl: if a function, should execute and return a valid url or object containing { url, cookie }

Scrape: should load out of the live production cache and return a well-formed result.

If the cache misses, the integration test runner can invoke a crawl for that source and write it to disk locally to complete the test.

jzohrab commented 4 years ago

Adding some notes in a google doc to think of some test scenarios.

jzohrab commented 4 years ago

Every test scenario in that google doc has been added except for this:

Every date in the live prod cache should be scrapable, and not return any errors. We will iterate through all dates in the cache (potentially even every date/time, which indicates a data set), and the corresponding scraper should return data. Some crawlers use an array of “subregion names” (e.g. county names) to create URLs, but then change that list over time. That would result in cache misses, which must never occur.

I still think this is necessary for successful regeneration and system/data stability.