Priesemann-Group / covid19_inference

Bayesian python toolbox for inference and forecast of the spread of the Coronavirus
GNU General Public License v3.0
73 stars 70 forks source link

More cleanup data retrieval #13

Closed semohr closed 4 years ago

semohr commented 4 years ago

Fixes #8

that every filter function accepts datetime.datetime objects as begin and end date, and not strings

That the output of the JHU has datetime.datetime objects as index. I think that's what currently to iso function is doing.

  • output dataframe was transposed somehow, because of that the __to_ios function and sums did not work as expected, should be fixed now

    that the filter functions always return the new daily cases, and not the cumulative ones. That change would involve to calculate the difference in the jhu dataset.

  • added methods to get new daily cases source.getnew* (confirmed,recovered,deaths) df.diff

The data should then exclude the date date_begin, and include the date date_end (in order to have as length (date_end - date_begin).days).

  • Not too sure what you mean by this, do you want to calculate the summed cases in the interval [begin_date, end_date]? With a reference to the period of time of the summed up cases (date_end - date_begin) in days? I could create a method that does this.

Additionally added a method in the jhu source, that gives a list of all possible countries and states.

jdehning commented 4 years ago

Not too sure what you mean by this, do you want to calculate the summed cases in the interval [begin_date, end_date]? With a reference to the period of time of the summed up cases (date_end - date_begin) in days? I could create a method that does this.

I think if you use df.diff, the number of rows of the table decreases by one. Then there is the question on how we deal with it and also how we try to be somewhat consistent in what the cumulative and the new_* function returns. I would argue, that the most sensible is to exclude the first date from the returned rows. So the new cases are simply calculated by taking the difference between neigbouring rows (df.diff) and exclude the first date index from the results. I don't know what your current implementation is doing exactly and what df.diff is doing in that respect. For the cumulative function, I would then also exclude the first row of the results. Otherwise it seems to look good.

semohr commented 4 years ago

The current implementation of the new_* functions should be doing exactly what you described. :thumbsup: