possibly hours-old problem with COVID19 'death's column

dankelley commented 4 years ago

First, I apologize that this issue is quite long. You can basically see my problem by looking at the code and output blocks at the bottom. I think there may be a problem with COVID19 that did not exist yesterday.

I'm wondering whether some has changed very recently with COVID19, in the deaths column. Below is some code that shows unexpected results. I am not sure whether this is a difficulty in how subset is working, how [ is working, or perhaps in the deaths column. I am not familiar with working with tibbles, having started using R long before they were invented, so maybe both my trial methods for extracting data are faulty?

NOTE: I am not querying by ISO codes for country names, because I simply don't know all the names, whereas I do know the actual names. Also, I'm doing this for nearly 200 countries, and I fear that calling covid19() that many times will be slow.

My confusion points are

why do [ and subset give different results?
why does subset give incorrect results (i.e. max per country is identical to max per world)
how can the [ work so differently for different countries

As a clue, I am pretty sure the results I am getting this morning are different from those I got yesterday; the previous results were not giving 0 deaths in countries where I know for sure there have been deaths.

The R code

library(COVID19)
d <- covid19(end=Sys.Date()-1)
cat("World:\n    ", max(d$deaths), "deaths\n")
for (country in c("Australia", "Canada", "United Kingdom", "United States")) {
    cat(country, ":\n", sep="")
    sub1 <- subset(d, d$country == country)
    cat("    method 1 reveals ", max(sub1$deaths), "deaths\n")
    sub2 <- d[d$country == country, ]
    cat("    method 2 reveals ", max(sub2$deaths), "deaths\n")
}

gives output

World:
     56259 deaths
Australia:
    method 1 reveals  56259 deaths
    method 2 reveals  0 deaths
Canada:
    method 1 reveals  56259 deaths
    method 2 reveals  0 deaths
United Kingdom:
    method 1 reveals  56259 deaths
    method 2 reveals  21092 deaths
United States:
    method 1 reveals  56259 deaths
    method 2 reveals  56259 deaths

eguidotti commented 4 years ago

This doesn't seem to be related to the package... anyway here the solution:

library(COVID19)
d <- covid19(end=Sys.Date()-1)
cat("World:\n    ", max(d$deaths), "deaths\n")
for (country_name in c("Australia", "Canada", "United Kingdom", "United States")) {
  cat(country_name, ":\n", sep="")
  sub1 <- subset(d, country == country_name)
  cat("    method 1 reveals ", max(sub1$deaths), "deaths\n")
  sub2 <- d[d$country == country_name, ]
  cat("    method 2 reveals ", max(sub2$deaths), "deaths\n")
}

When you run subset(d, d$country == country), the variable country is the column of d, not the country variable you defined above. See the documentation of ?subset

dankelley commented 4 years ago

Thanks. I just ran your suggested code. It updated COVID19, and I got as below. Do you get similar? I notice 0 deaths for two countries that have had deaths, and for the US, I get the same as for the world.

I'm sorry to be a bother.

World:
     56259 deaths
Australia:
    method 1 reveals  0 deaths
    method 2 reveals  0 deaths
Canada:
    method 1 reveals  0 deaths
    method 2 reveals  0 deaths
United Kingdom:
    method 1 reveals  21092 deaths
    method 2 reveals  21092 deaths
United States:
    method 1 reveals  56259 deaths
    method 2 reveals  56259 deaths

covid19datahub / COVID19

possibly hours-old problem with COVID19 'death's column #23