deployment-gap-model-education-fund / deployment-gap-model

ETL code for the Deployment Gap Model Education Fund
https://www.deploymentgap.fund/
MIT License
6 stars 2 forks source link

Integrate Deck Data #176

Closed TrentonBush closed 2 years ago

TrentonBush commented 2 years ago

Deck provides about 10 Big Query datasets that contain multiple tables. Deck is also providing queries that answer some DBCP questions.

How do we want to integrate the deck data? Here are some ideas.

  1. Add all of the Hubble tables to our data warehouse, and use Deck's queries to create data_marts.
  2. Add just the required Hubble tables to our data warehouse, and use Deck's queries to create data_marts.
  3. Add the tables produces by Deck's queries to our data warehouse.

I think I'm in favor of option 2 so our data marts can have access to all of the raw data. Option 1 would probably be overkill because we'd be copying hundreds of gigabytes from BQ to postgres. Option 3 limits our access to the raw data.

A con of copying raw Hubble data tables to postgres is that these tables are quite large. 10 of millions of rows. We'll end up getting charged some hefty egress fees. To limit these fees we should probably implement CD that runs on GCP so the egress is free.

TrentonBush commented 2 years ago

I was imagining something more like option 3 because it seemed faster and easier, but I can see how Option 2 makes neater boundaries. I guess I see Option 3 as analogous to how we use Zenodo: store raw data externally and use an ETL process to create derivative tables in the local warehouse.

I don't feel strongly either way about the storage aspect, but I am wary of doing a bunch of extra engineering work as a consequence, either to avoid egress fees or for the sake of developer experience (the presumably large downloads would take a long time for each make run_etl) At $0.12/GB egress (I think that's the right number?) and $125/hour billing rate, each hour of engineering time has to save over a TB of egress to break even for the client. Plus, since we are capacity constrained, they would probably rather spend those hours getting their product working and leave cloud cost concerns for the future. Maybe that is still worth it (idk how big these tables are) but those are the things I would consider.

bendnorman commented 2 years ago

Ok! Let's start with Option 3 because it seems the most straightforward. Also if egress price becomes an issue we can come up with a plan to do more our the data processing in the cloud so we don't have to pay egress fees.

bendnorman commented 2 years ago

Ooo but this decision kind of depends on how the c3 and c4 work should be split up.