DFscript / Covid_19_data_visualisation

Improve covid 19 data transparency for better decision making
2 stars 1 forks source link

Accumulated bar plot shouldn't drop off #19

Closed gooney47 closed 4 years ago

gooney47 commented 4 years ago

There should be no drop in cases for the accumulated plot. Take accumulated Bayern for example, it peaks at 517 and then drops down back to 325. I'm pretty sure the underlying data_set_accumulated.csv has proper values but I think they aren't properly propagated.

Duragtan commented 4 years ago

I can confirm now, that the data comes from the filter_data_function already flawed.

`` filter_data_set(acc_new=True)[0]

2020-03-18 3473.106965 811.321207 9 2036

2020-03-19 3571.492214 829.568217 14 2401

2020-03-20 2693.451511 631.157273 19 1874 ``

Before that it is somewhat hard to track because the numbers are only being computed from various different entries in that function, but I will keep trying.

Also I am stilled puzzled why we do have a separate data set for accumulated data in first place. Should it be possible to compute the accumulated data from the per day cases easy enough?

Duragtan commented 4 years ago

I can compute the accumulated cases from the not accumulated data easy enough:

` df = read_cases_data(acc_new=False)

df_bayern = df[df["country"] == "Bayern"].groupby("timestamp").sum()

df_bayern["infected_cum"] = df_bayern["infected"].cumsum()

df_bayern

2020-03-18 00:00:00+00:00 3473.106965 811.321207 ... 517 2168

2020-03-19 00:00:00+00:00 3571.492214 829.568217 ... 467 2635

2020-03-20 00:00:00+00:00 2693.451511 631.157273 ... 325 2960

`

But without knowledge about how the accumulated-data-csv came into being and why it is there, I am stuck with hunting this error.

gooney47 commented 4 years ago

https://github.com/DFscript/Covid_19_data_visualisation/blob/d3011ef0a1c28782f06817541063650035e4abf6/frontend/application.py#L425 This and the following LOC's show how the CSV's were created.

There is no particular reason why there is an accumulated one. But as already said in the beginning, I'm pretty sure that the error is not in the CSV's but in the processing of them.

gooney47 commented 4 years ago

Now that I look at those groupby's again I see that there is actually an error, the first grouping is for grouping the counties and the second should be grouping the countRies.

gooney47 commented 4 years ago

But it shouldn't be the source of the error though since the map circles actually grow consistently. The error is only with the bar plot.

Duragtan commented 4 years ago

I have fixed a typo in the generation of cases-csv-files (county -> country) line 438 and after updating the csv-files with this fix, the problem vanished.

However, there are open questions to me:

  1. After line 437-438 the data should imho be sorted 1. by county and 2. by timestamp. Before the fix the data was sorted by county and timestamp which should have no effect at all. Sorting that same data by country and timestamp afterwards should move entire "county-blocks" around but not alter the sequence of rows within a county, provided that any entire county is always part of the same country. So the subsequent groupby county and cumsum should be unaffected. But apparently I am missing something here.
  2. Why was only the display of the bar-char distorted but not the bubbles on the map?

What I have not tested is, whether the regeneration of the csv-files without first applying the patch (county -> country in line 438) would have fixed the error anyway. So maybe parts of my confusion are due to a misconception of cause and effect?! This would, however, still not explain point 2.

Anyway, since the problem disappeared, I will close this bug.

gooney47 commented 4 years ago

The answer to both of your questions is that the way how the bar plot is generated from the CSV is inaccurate. This should probably be investigated.