GoogleCloudPlatform / covid-19-open-data

Datasets of daily time-series data related to COVID-19 for over 20,000 distinct locations around the world.
Apache License 2.0
472 stars 131 forks source link

docs: understanding locations #554

Open chapmanjacobd opened 1 year ago

chapmanjacobd commented 1 year ago

Good day,

I'm trying to understand the context of place_id in various files. I know that place_id is just an identifier but I have encountered some puzzling things. Before I dive deep into my questions I will start light by asserting my beliefs about the data and how it is joined together. If there are incorrect beliefs please correct them:

β–‘β–‘β–’β–ˆ ~ (main|?1) [2|1]πŸ¦‹ curl -sS https://api.github.com/repos/GoogleCloudPlatform/covid-19-open-data | grep created_at
  "created_at": "2020-07-23T23:43:51Z",
β–“β–ˆβ–‘β–’ ~ (main|?1) [0|0]πŸ₯ž curl -sS https://api.github.com/repos/google-research/open-covid-19-data | grep created_at
  "created_at": "2020-05-21T03:35:01Z",

How does mobility.csv relate to Global_Mobility_Report.csv ?

They seem to be talking about exactly the same thing...

But it seems like they are different data products entirely:

sqlite-utils memory Global_Mobility_Report.csv "select count(distinct place_id) from t1"
[{"count(distinct place_id)": 13249}]

sqlite-utils memory mobility.csv "select count(distinct location_key) from t1"
[{"count(distinct location_key)": 7351}]

as well as with aggregated.csv:

xsv select place_id aggregated.csv | sort --unique > aggregated_place_ids.csv
xsv select place_id Global_Mobility_Report.csv | sort --unique > Global_Mobility_Report_place_ids.csv

combine aggregated_place_ids.csv not Global_Mobility_Report_place_ids.csv  | count
14283
combine Global_Mobility_Report_place_ids.csv not aggregated_place_ids.csv  | count
5913
chapmanjacobd commented 1 year ago

After reading through more code I think I get it now

https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/e2f6c1c0840fa1dc301ed798f6a624781b453c19/src/pipelines/mobility/google_mobility.py https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/15e2bdd4b1c7a523a74f42b3ada89f3686dbc882/src/pipelines/mobility/config.yaml

"Global_Mobility_Report.csv" is a source dataset which joins with other data, via knowledge_graph.csv, to create "mobility.csv" and "aggregates.csv"