covidatlas / coronadatascraper

COVID-19 Coronavirus data scraped from government and curated data sources.
https://coronadatascraper.com
BSD 2-Clause "Simplified" License
364 stars 179 forks source link

COVID Atlas 1.0 architecture & plans #782

Closed ryanblock closed 4 years ago

ryanblock commented 4 years ago

As with any project of meaningful utility and scale, we never know all of its needs up front.

First, we build the thing, and then we see where it takes us. We learn as quickly as possible, adapt, and grow. (Who could have anticipated that governments would publish pandemic case data in PDFs or images? Or require cookies and csrf tokens to just request a page containing basic public health data?)

The purpose of this document is to discuss the future architecture plans¹ for COVID Atlas.

This issue assumes a semi-large scale refactor.

I know, this can make folks feel uncomfortable. It makes me somewhat uncomfortable. It's also where we are.

A quick spoiler: scrapers may need some updating, but they will be preserved! We love our scrapers. We are not tossing out the scrapers!

Why start fresh

The initial analysis I did of the coronadatascraper codebase seemed promising for an in-flight, gradual refactor into production infrastructure.

After spending the last few weeks in the codebase, discovery surfaced deep underlying architectural flaws that posed significant barriers to overcoming core issues in our current processes.

For those that may not be aware of the problems downstream of these issues, they include such fan favorites as: Larry has to stay up until 10pm every night manually releasing the latest data set, which only he knows how to do; unexpected errors can fatally break our entire build; and, even minor changes require a large degree of manual verification.

@lazd and I agree agree these issues are fundamental and must be addressed with seriousness, care, and immediacy.

Second-system syndrome

We must immediately call out a common reason refactors or rewrites may fail: second-system syndrome.

Putting aside the fact that this codebase is only a few weeks old, we still need to be clear about expectations: v1.0 will likely seem like a step back at first; it will do fewer things, and the things it does may be approached differently.

This issue is not a dropbox for every idea we have, or a long-term roadmap for the future. This issue is a plan to get us into robust and stable production infra as soon as possible, and to begin phasing out parts of CDS as quickly as possible.


What we learned from v0 (coronadatascraper) architecture

Over the last few weeks, we learned an enormous amount from coronadatascraper. Below is a summary of a few of those findings that informed this decision, and will continue to inform our architecture moving forward:

Crawling

Scraping

Data normalization + tagging

Local workflows

Testing


Moving towards 1.0 architecture

Prerequisites


Key processes

References to the "core data pipeline" refer to the important, timely information required to publish up to date case data to covidatlas.com location views, our API, etc.

Crawling

Scraping

Annotator (updating locations' metadata) ← name needs work

Metadata updater ← name needs work

Blob publishing (tbd)


I'm looking forward to your thoughts, questions, feedback, concerns, encouragement, apprehension, and giddiness.

Let's discuss – and expect to see a first cut this week!


¹ Previous planning took place in https://github.com/covidatlas/coronadatascraper/issues/236 + https://github.com/covidatlas/coronadatascraper/issues/295

shaperilio commented 4 years ago

Join #v1_arch on slack if you'd like to discuss.

jzohrab commented 4 years ago

@ryanblock - this issue can likely be closed, thoughts?

jzohrab commented 4 years ago

Closing this issue, the new architecture is up and running, even though we're still moving over. :-) Cheers all! jz