Cumulative tests are the cumsum of the linearly interpolated new tests

findanna commented 2 years ago

@christophsax @benubah the cumulative tests are not the cumulative sum of the 7-day rolling average new tests, but the cumulative sum of the new_tests before the rolling average is applied. See code here.

The last all_cum_tests value should always match with the last cum_tests_orig value.

christophsax commented 2 years ago

Good spot, thanks.

The last all_cum_tests value should always match with the last cum_tests_orig value.

@benubah This is something that we can add as an expectation to the tests, too.

benubah commented 2 years ago

This should make more sense. Let's see

benubah commented 2 years ago

Most countries have equal cum_tests_orig and all_cum_tests, but not all:

remotes::install_github("dsbbfinddx/shinyfind", ref="summarize-tweaks")

library(dplyr)
library(shinyfind)

heartbeat_local <- NA
. <- shinyfind::get_data_all()
data_all <- .$data_all
country_last_update_info <- .$country_last_update_info

data_cum_tests <-
  .$data_all |>
  filter(set == "country") |>
  filter(time == as.Date("2022-04-22")) |>
  select(name, cum_tests_orig, all_cum_tests) |>
  mutate(remark = if_else(cum_tests_orig == all_cum_tests, "EQUAL", "NOT EQUAL"))

print(data_cum_tests)
#> # A tibble: 179 x 4
#>    name                 cum_tests_orig all_cum_tests remark   
#>    <chr>                         <dbl>         <dbl> <chr>    
#>  1 Afghanistan                  938587        938587 EQUAL    
#>  2 Angola                      1499795       1499795 EQUAL    
#>  3 Albania                     1795792       1791230 NOT EQUAL
#>  4 Andorra                      300307        300307 EQUAL    
#>  5 United Arab Emirates      153553806     153553806 EQUAL    
#>  6 Argentina                  35716069      35716069 EQUAL    
#>  7 Armenia                     3026628       3026628 EQUAL    
#>  8 Antigua & Barbuda             16700         16700 EQUAL    
#>  9 Australia                  68626642      68587330 NOT EQUAL
#> 10 Austria                   181007797     181007797 EQUAL    
#> # ... with 169 more rows

^{Created on 2022-04-22 by the reprex package (v2.0.1)}

christophsax commented 2 years ago

Thanks! Perhaps we can repeat with the latest data_all.csv, once it is produced by the latest versions of the workflow and shinyfind.

Can we show the rows with diffs only, and sort them decreasingly by the size of the diff, so that we see the really problematic ones first? abs(log(cum_tests_orig) - log(all_cum_tests)) gives the approximate percentage size differences.

Two minor things:

We don't need print() to print.
heartbeat_local <- NA do we need that? If not, please remove. If we do, we should get rid of it in shinyfind::get_data_all(). Not urgent, but we could open an issue for that. The whole heartbeat stuff is not needed anymore, since we use memoise.

benubah commented 2 years ago

Can we show the rows with diffs only, and sort them decreasingly by the size of the diff, so that we see the really problematic ones first?

The number of rows with diff seems to differ per day.

On 2022-04-22, we have:

library(dplyr)
library(shinyfind)

. <- shinyfind::get_data_all()
data_all <- .$data_all
country_last_update_info <- .$country_last_update_info

data_cum_tests <-  
  data_all |>
  filter(set == "country") |>
  filter(time == as.Date("2022-04-22")) |>
  select(name, cum_tests_orig, all_cum_tests) |>
  filter(cum_tests_orig != all_cum_tests) |>
  mutate(diff = abs(log(cum_tests_orig) - log(all_cum_tests))) |>
  arrange(desc(diff))

data_cum_tests
#> # A tibble: 66 x 4
#>    name              cum_tests_orig all_cum_tests   diff
#>    <chr>                      <dbl>         <dbl>  <dbl>
#>  1 Trinidad & Tobago         233175        693033 1.09  
#>  2 El Salvador              1843224       2432752 0.278 
#>  3 Belgium                 33456470      36618631 0.0903
#>  4 South Sudan               374797        406944 0.0823
#>  5 Thailand                22978475      23754384 0.0332
#>  6 Kazakhstan              11575012      11276018 0.0262
#>  7 Belize                    532846        522306 0.0200
#>  8 Spain                   62986857      63817211 0.0131
#>  9 Japan                   44500652      45075155 0.0128
#> 10 Netherlands             28622957      28947062 0.0113
#> # ... with 56 more rows

^{Created on 2022-04-27 by the reprex package (v2.0.1)}

On 2022-04-25, we have:

library(dplyr)
library(shinyfind)

. <- shinyfind::get_data_all()
data_all <- .$data_all
country_last_update_info <- .$country_last_update_info

data_cum_tests <-  
  data_all |>
  filter(set == "country") |>
  filter(time == as.Date("2022-04-25")) |>
  select(name, cum_tests_orig, all_cum_tests) |>
  filter(cum_tests_orig != all_cum_tests) |>
  mutate(diff = abs(log(cum_tests_orig) - log(all_cum_tests))) |>
  arrange(desc(diff))

data_cum_tests
#> # A tibble: 34 x 4
#>    name          cum_tests_orig all_cum_tests    diff
#>    <chr>                  <dbl>         <dbl>   <dbl>
#>  1 El Salvador          1843224       2432752 0.278  
#>  2 Belgium             33456470      36675197 0.0919 
#>  3 South Sudan           378620        410357 0.0805 
#>  4 Kazakhstan          11575012      11276018 0.0262 
#>  5 Belize                532846        522306 0.0200 
#>  6 Spain               62986857      63817211 0.0131 
#>  7 Japan               44832046      45406549 0.0127 
#>  8 Netherlands         28622957      28947062 0.0113 
#>  9 New Caledonia          42756         42391 0.00857
#> 10 Iran                50708575      50969794 0.00514
#> # ... with 24 more rows

^{Created on 2022-04-27 by the reprex package (v2.0.1)}

benubah commented 2 years ago

Getting rid of heartbeat and all of check_for_update here: https://github.com/finddx/shinyfind/pull/31/commits/52860443693473b6365724920e204750270e9bbd

finddx / FINDCov19Tracker

Cumulative tests are the cumsum of the linearly interpolated new tests #28