Closed ryanblock closed 4 years ago
Join #v1_arch on slack if you'd like to discuss.
@ryanblock - this issue can likely be closed, thoughts?
Closing this issue, the new architecture is up and running, even though we're still moving over. :-) Cheers all! jz
As with any project of meaningful utility and scale, we never know all of its needs up front.
First, we build the thing, and then we see where it takes us. We learn as quickly as possible, adapt, and grow. (Who could have anticipated that governments would publish pandemic case data in PDFs or images? Or require cookies and csrf tokens to just request a page containing basic public health data?)
The purpose of this document is to discuss the future architecture plans¹ for COVID Atlas.
This issue assumes a semi-large scale refactor.
I know, this can make folks feel uncomfortable. It makes me somewhat uncomfortable. It's also where we are.
A quick spoiler: scrapers may need some updating, but they will be preserved! We love our scrapers. We are not tossing out the scrapers!
Why start fresh
The initial analysis I did of the
coronadatascraper
codebase seemed promising for an in-flight, gradual refactor into production infrastructure.After spending the last few weeks in the codebase, discovery surfaced deep underlying architectural flaws that posed significant barriers to overcoming core issues in our current processes.
For those that may not be aware of the problems downstream of these issues, they include such fan favorites as: Larry has to stay up until 10pm every night manually releasing the latest data set, which only he knows how to do; unexpected errors can fatally break our entire build; and, even minor changes require a large degree of manual verification.
@lazd and I agree agree these issues are fundamental and must be addressed with seriousness, care, and immediacy.
Second-system syndrome
We must immediately call out a common reason refactors or rewrites may fail: second-system syndrome.
Putting aside the fact that this codebase is only a few weeks old, we still need to be clear about expectations: v1.0 will likely seem like a step back at first; it will do fewer things, and the things it does may be approached differently.
This issue is not a dropbox for every idea we have, or a long-term roadmap for the future. This issue is a plan to get us into robust and stable production infra as soon as possible, and to begin phasing out parts of CDS as quickly as possible.
What we learned from v0 (
coronadatascraper
) architectureOver the last few weeks, we learned an enormous amount from
coronadatascraper
. Below is a summary of a few of those findings that informed this decision, and will continue to inform our architecture moving forward:Crawling
2020-04-01T00:00:00.000Z
crawl forSan Francisco, CA
must somewhere, at some point, cast its data to2020-03-31
Scraping
Data normalization + tagging
Dekalb County
vs.DeKalb County
||Alexandria City
vsAlexandria city
LaSalle Parish
vsLa Salle Parish
Yakutat City and Borough
,Skagway Municipality
,Hoonah-Angoon Census Area
Uintah
,Duchesne
, andDaggett
counties) intoTricounty
, which requires denormalizationLocal workflows
Testing
Moving towards 1.0 architecture
Prerequisites
Key processes
Crawling
Scraping
Annotator (updating locations' metadata) ← name needs work
Metadata updater ← name needs work
Blob publishing (tbd)
Any large published datasets that we don't want to make accessible by dynamic API, we will accomplish in a blob publishing operation
I'm looking forward to your thoughts, questions, feedback, concerns, encouragement, apprehension, and giddiness.
Let's discuss – and expect to see a first cut this week!
¹ Previous planning took place in https://github.com/covidatlas/coronadatascraper/issues/236 + https://github.com/covidatlas/coronadatascraper/issues/295