covid19datahub / COVID19

A worldwide epidemiological database for COVID-19 at fine-grained spatial resolution
https://covid19datahub.io
GNU General Public License v3.0
251 stars 92 forks source link

problem with today's data #4

Closed dankelley closed 4 years ago

dankelley commented 4 years ago

Many thanks for this package.

I'm wondering whether I'm missing something, as illustrated with the R script and output given below, run using updated COVID19 as updated a few minutes ago.

Note the most recent value of confirmed, for example.

I can work around this issue, by ignoring today's data if they disagree badly with the data on the day before, but I am pointing this out in case it reveals a problem that you might want to look at. (Or, perhaps, is there a way provided by COVID19 to skip not-yet-complete data?)

R script

library(COVID19)
old <- world("country")
new <- covid19()
for (country in c("Canada", "United States")) {
    cat("#", country, "\n")
    print(tail(old[old$country == country, ], 3))
    print(tail(new[new$country == country, ], 3))
}

Output


R version 4.0.0 alpha (2020-04-01 r78130)
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin17.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(COVID19)
> old <- world("country")
> new <- covid19()
> for (country in c("Canada", "United States")) {
+     cat("#", country, "\n")
+     print(tail(old[old$country == country, ], 3))
+     print(tail(new[new$country == country, ], 3))
+ }
# Canada 
# A tibble: 3 x 21
# Groups:   id [1]
  id    date       deaths confirmed tests recovered  hosp   icu  vent country
  <chr> <date>      <dbl>     <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl> <chr>  
1 CAN   2020-04-13    779     25674     0    107480     0     0     0 Canada 
2 CAN   2020-04-14    899     27029     0    116822     0     0     0 Canada 
3 CAN   2020-04-15      0         8     0      8210     0     0     0 Canada 
# … with 11 more variables: state <lgl>, city <lgl>, lat <dbl>, lng <dbl>,
#   pop <int>, pop_14 <dbl>, pop_15_64 <dbl>, pop_65 <dbl>, pop_age <dbl>,
#   pop_density <dbl>, pop_death_rate <dbl>
# A tibble: 3 x 21
# Groups:   id [1]
  id    date       deaths confirmed tests recovered  hosp   icu  vent country
  <chr> <date>      <dbl>     <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl> <chr>  
1 CAN   2020-04-13    779     25674     0    107480     0     0     0 Canada 
2 CAN   2020-04-14    899     27029     0    116822     0     0     0 Canada 
3 CAN   2020-04-15      0         8     0      8210     0     0     0 Canada 
# … with 11 more variables: state <lgl>, city <lgl>, lat <dbl>, lng <dbl>,
#   pop <int>, pop_14 <dbl>, pop_15_64 <dbl>, pop_65 <dbl>, pop_age <dbl>,
#   pop_density <dbl>, pop_death_rate <dbl>
# United States 
# A tibble: 3 x 21
# Groups:   id [1]
  id    date       deaths confirmed tests recovered  hosp   icu  vent country
  <chr> <date>      <dbl>     <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl> <chr>  
1 USA   2020-04-13  23468    578978     0         0     0     0     0 United…
2 USA   2020-04-14  25770    605948     0         0     0     0     0 United…
3 USA   2020-04-15      0         0     0         0     0     0     0 United…
# … with 11 more variables: state <lgl>, city <lgl>, lat <dbl>, lng <dbl>,
#   pop <int>, pop_14 <dbl>, pop_15_64 <dbl>, pop_65 <dbl>, pop_age <dbl>,
#   pop_density <dbl>, pop_death_rate <dbl>
# A tibble: 3 x 21
# Groups:   id [1]
  id    date       deaths confirmed tests recovered  hosp   icu  vent country
  <chr> <date>      <dbl>     <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl> <chr>  
1 USA   2020-04-13  23468    578978     0         0     0     0     0 United…
2 USA   2020-04-14  25770    605948     0         0     0     0     0 United…
3 USA   2020-04-15      0         0     0         0     0     0     0 United…
# … with 11 more variables: state <lgl>, city <lgl>, lat <dbl>, lng <dbl>,
#   pop <int>, pop_14 <dbl>, pop_15_64 <dbl>, pop_65 <dbl>, pop_age <dbl>,
#   pop_density <dbl>, pop_death_rate <dbl>
eguidotti commented 4 years ago

Thanks for the hint! I just added the start and end arguments. Skip yesterday by default, as some observation may be incomplete. Let me know if this works for you and I can close the issue.

dankelley commented 4 years ago

Thanks. I wonder if it would make sense for the default end to be the day before yesterday, as opposed to 2 days ago? And, if you're worried about backward compatibility, you perhaps ought to use today's date. (NOTE: I am no sure on the timezone issue, here.)

Please feel free to close, either way.

PS. I sort of solved this in my code that uses covid19(), by computing the standard deviation of the last week (skipping today) and seeing if today differed from yesterday by more than 2 times this standard deviation. Sometimes, the most recent point seems OK, which I assume relates to the timezone of the nation in question.

eguidotti commented 4 years ago

You are right, end default should be Sys.date(). Also, I found a bug in aggregating the data on the last date for not-yet-complete data. Thanks for your help!

dankelley commented 4 years ago

Thanks again. By the way, http://emit.phys.ocean.dal.ca/~kelley/covid19/ holds results that use your package to download data. (Previously, I was doing direct downloads from the Johns Hopkins server, but I thought it preferable to direct people to your nice package.) I live in Canada, which explains why I do a sub-national breakdown for that country.

Stay well, and stay safe!

-- Dan.

eguidotti commented 4 years ago

Great to hear that! The project is funded by IVADO, Canada. It will soon appear in this repo. I added your results in the Use Cases section in the README. If you want to contribute in the data collection, don't hesitate to jump on the mission! See how to contribute

Stay safe! -- Emanuele

eguidotti commented 4 years ago

Btw, I realized your link http://emit.phys.ocean.dal.ca/~kelley/covid19/ is using the deprecated world() function. It would be great if you could replace with covid19(). If it's not too much work of course

dankelley commented 4 years ago

Um ... is it? I don't see that. (I thought I changed them all.) Maybe you can tell me the file URL you're looking at?

eguidotti commented 4 years ago

I'm reading the URL you sent: http://emit.phys.ocean.dal.ca/~kelley/covid19/

The data are acquire from the Johns Hopkins University Center for Systems Science and Engineering by the world() function provided by the COVID19 R package

Also, I read:

Note: the COVID19 package used to download Canadian provincial data may soon switch the name of the downloading function. A signal that [...]

Maybe just a cache issue?

dankelley commented 4 years ago

Thanks very much. I've fixed that. (I had forgotten to update the .html file, since my main focus was on the .R files, when I changed some country names today to match what covid19() returns as of today.)

Stay well!