Update summary - Githubissues

finddx / FINDCov19Tracker

https://dsbbfinddx.github.io/FINDCov19Tracker/

Other

0 stars 1 forks source link

Update summary #27

Closed benubah closed 2 years ago

benubah commented 2 years ago

To harmonize our data everywhere we need this PR:

Filter out countries that do not have a last date of update from the summaries of data_all as we do in the tracker. It reduces the size of data_all.csv by >1MB
```
left_join(country_last_update_info, by = "unit") |>
filter(!is.na(last_update))
```
1. Drop avg_ values and use sum_ values instead - we need the sums instead of averages for groups in data_all.csv because we are summarizing for a single day.
```
rename(pos = avg_pos) |>
rename_with(\(x) gsub("^sum_", "", x)) 
```

christophsax commented 2 years ago

Thanks @benubah, 1. looks good to me.

Re 2. But shouldn't avg_ and sum_ be identical is we summarize over one day only? I used avg_ because I thought there may be a case where we have multiple entries per day (we shouldn't) and averaging seems more reasonable then summing up. If you can confirm the two are the same I am fine with your change, just want to understand your reasoning.

benubah commented 2 years ago

Re 2. But shouldn't avg_ and sum_ be identical is we summarize over one day only? I used avg_ because I thought there may be a case where we have multiple entries per day (we shouldn't) and averaging seems more reasonable then summing up. If you can confirm the two are the same I am fine with your change, just want to understand your reasoning.

avg_cap_ and sum_cap_ are the same, while avg_all_ and sum_all_ are not the same. And at the end, we need all_new_ = cap_new_ * pop. This seems to work when we use sum_ and drop avg_