covidatlas / li

Next-generation serverless crawler for COVID-19 data
Apache License 2.0
57 stars 33 forks source link

US data reports from the future #343

Closed rtwfroody closed 4 years ago

rtwfroody commented 4 years ago

Today is July 20, 9:28pm UTC.

  1. The report shows that there are fewer cases on the 20th than there were on the 19th.
  2. The report has an entry for July 21st, which has not happened yet, even in UTC. (I assume dates are in the local timezone, however.)
      "2020-07-19": {
        "cases": 3755968,
        "deaths": 132918,
        "hospitalized_current": 57247,
        "tested": 45737379,
        "icu_current": 6391,
        "hospitalized": 277378,
        "recovered": 1131121,
        "icu": 12391,
        "growthFactor": 1.02
      },
      "2020-07-20": {
        "cases": 3742236,
        "deaths": 138485,
        "growthFactor": 1
      },
      "2020-07-21": {
        "cases": 3742236,
        "deaths": 138485,
        "growthFactor": 1
      }
jzohrab commented 4 years ago

Maybe our code is so efficient that it runs in the future.

Thanks very much for this check, @rtwfroody . That future date doesn't make any sense.

For the case count decreasing, that is suspect but is also possible. We can only report on what the live sources provide. I've seen cases where one source available on one day but not on another, so we fall back to the next available (lower-priority) source. Under the dates field, there is a dateSources field. Can you report what you see there in this issue?

In the meantime, I'm downloading and looking at the report. Cheers and thank you very much! It's great to have more eyes on the data. jz

jzohrab commented 4 years ago

Which country / state / locationID was this for, @rtwfroody ?

jzohrab commented 4 years ago

Some notes aftergrepping.

The locationID is US:

$ grep 3742236 -A 10 -B 10000 timeseries-byLocation.json | grep locationID | tail -n 1
    "locationID": "iso1:us",

The dateSources switch from covidtracking to jhu-usa:

$ grep 3742236 -A 10 -B 10 timeseries-byLocation.json 
        "deaths": 132918,
        "hospitalized_current": 57792,
        "tested": 45737379,
        "icu_current": 6391,
        "hospitalized": 277378,
        "recovered": 1131121,
        "icu": 12391,
        "growthFactor": 1.02
      },
      "2020-07-20": {
        "cases": 3742236,
        "deaths": 138485,
        "growthFactor": 1
      },
      "2020-07-21": {
        "cases": 3742236,
        "deaths": 138485,
        "growthFactor": 1
      }
    },
    "dateSources": {
      "2020-01-23..2020-07-19": "us-covidtracking",
      "2020-07-20..2020-07-21": "jhu-usa"
    },

us-covidtracking has been deemed to have a higher priority than jhu-usa (jhu-usa priority = -1, covidtracking priority = 0.5), so if both are present, us-covidtracking wins.

The dateSources shows that from 2020-01-23 to 2020-07-19, the source "us-covidtracking" was used, but for 07-20 and 07-21 jhu-usa was used. At the moment, this does not make sense to me at all. I checked the source URL for covidtracking, and it does have data for 2020-07-20; however, this may not have been the case at the time my report was run.

I also can't explain why it's reported as in the future ... though I expect it has something to do with the timezone and how the jhu-usa data is "dated".

I do see that there is a problem with the way that jhu-usa is getting its data though ... if the requested date is greater than the last date of the available data, it simply reports the last data. e.g., if the last date present in the data is July 14, and we're running the scrape on July 16, it reports July 14 numbers as if they were July 16. This is completely wrong, I've opened https://github.com/covidatlas/li/issues/344 to deal with this.

acertas commented 4 years ago

The issue still present today. There are 07/28 date in the data, which is tomorrow. Anyone knows what happened? What is the right number?

jzohrab commented 4 years ago

Haven't had time to check yet, it's on the list of things to look at soon. Cheers, jz

On Mon, Jul 27, 2020 at 7:09 PM acertas notifications@github.com wrote:

The issue still present today. There are 07/28 date in the data, which is tomorrow. Anyone knows what happened? What is the right number?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/covidatlas/li/issues/343#issuecomment-664683444, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMPWDPHVECQ5ACMRSOCAADR5YCI7ANCNFSM4PC2B5FQ .

jzohrab commented 4 years ago

I've fixed part of this, but will now look at jhu-usa again. Thanks all for the feedback!

jzohrab commented 4 years ago

@rtwfroody , I've just merged PR #365 for jhu-usa in staging and am launching it to production as well. There will likely still be future-dated data in the reports for a few days -- I still can't explain why! -- but then I believe that PR will fix the issue: the future data stuff will be overwritten with the correct data is we progress, until everything is caught up.

jzohrab commented 4 years ago

Hi @rtwfroody, is this still an issue?

rtwfroody commented 4 years ago

Looks this is fixed. Thank you. I still see the same issue e.g. for France, although maybe that has to do with timezones.

France:

      "2020-08-09": {
        "cases": 222408,
        "deaths": 30202,
        "recovered": 71585,
        "growthFactor": 1
      },
      "2020-08-10": {
        "cases": 222408,
        "deaths": 30202,
        "recovered": 71585,
        "growthFactor": 1
      }

That's captured on Aug 9 16:57 PDT.

Shane187187 commented 3 years ago

343