finddx / FINDCov19Tracker

https://dsbbfinddx.github.io/FINDCov19Tracker/
Other
0 stars 1 forks source link

Update summary #27

Closed benubah closed 2 years ago

benubah commented 2 years ago

To harmonize our data everywhere we need this PR:

  1. Filter out countries that do not have a last date of update from the summaries of data_all as we do in the tracker. It reduces the size of data_all.csv by >1MB
    left_join(country_last_update_info, by = "unit") |>
    filter(!is.na(last_update))
    1. Drop avg_ values and use sum_ values instead - we need the sums instead of averages for groups in data_all.csv because we are summarizing for a single day.
      rename(pos = avg_pos) |>
      rename_with(\(x) gsub("^sum_", "", x)) 
christophsax commented 2 years ago

Thanks @benubah, 1. looks good to me.

Re 2. But shouldn't avg_ and sum_ be identical is we summarize over one day only? I used avg_ because I thought there may be a case where we have multiple entries per day (we shouldn't) and averaging seems more reasonable then summing up. If you can confirm the two are the same I am fine with your change, just want to understand your reasoning.

benubah commented 2 years ago

Re 2. But shouldn't avg_ and sum_ be identical is we summarize over one day only? I used avg_ because I thought there may be a case where we have multiple entries per day (we shouldn't) and averaging seems more reasonable then summing up. If you can confirm the two are the same I am fine with your change, just want to understand your reasoning.

avg_cap_ and sum_cap_ are the same, while avg_all_ and sum_all_ are not the same. And at the end, we need all_new_ = cap_new_ * pop. This seems to work when we use sum_ and drop avg_

image