Create "diffs" datasets

ccodwg / CovidTimelineCanada

A definitive dataset for COVID-19 in Canada

https://opencovid.ca/

Other

27 stars 11 forks source link

Create "diffs" datasets #20

Closed jeanpaulrsoucy closed 12 months ago

jeanpaulrsoucy commented 2 years ago

It would be beneficial to have a parallel set of datasets called "diffs" which report the change for each PT/HR since the last time the data were updated. This would be useful now that many regions are not updating daily and when they do update, they are for dates in the past.

This could be incorporated quite simply into the update process. For each dataset, there is a parallel dataset with the following format:

Most recent date & value for each PT/HR
Previous most recent date & value for each PT/HR
The difference between these two values

When running the update script, a function would check if the newest date for the PT/HR was later than the current most recent date. If so, the current most recent date & value would be replaced, and this value would further replace the previous most recent date & value.

rogadev commented 2 years ago

One common trend across all provinces that I noticed was that data was not being updated over the weekend. We would tend to see a large spike every Monday for all provinces as numbers came in.

Something tangential that may be helpful is a 7-day trend. This would average out gaps in reporting to paint the picture of how case numbers were trending. I know I would use this in my app if it was available on the /summary response.

jeanpaulrsoucy commented 2 years ago

Potentially unexpected behaviour from diffs dataset, explained below (with potential solution):

Only potentially unexpected behaviour occurs when a dataset (or part of a dataset) is updated but does not increment the date (i.e., only historical information are updated). This occurs sometimes with the vaccine datasets from PHAC, as the numbers are only updated every few weeks but old numbers are sometimes updated on off-weeks. For now, the behaviour counts this as a diff with 0 days between dates, which should be interpreted as an update of historical data. The alternative would be to replace the current data value only and simply re-do the diff calculation, but it is quite possible that the previous week's value has changed as well, rendering the calculation unreliable. I suppose a potential solution would be to also update the previous value using the new dataset, but this might be a little much.

jeanpaulrsoucy commented 2 years ago

Now that the diffs datasets are added, we have to wait a few weeks for most of the fields to become populated. I must also add documentation of the diffs datasets to the README.

jeanpaulrsoucy commented 1 year ago

The function creating the "diffs" dataset should report the geographic regions at issue when aborting the update:

2: In doTryCatch(return(expr), name, parentenv, handler) :
  data/hr/cases_hr.csv: Geographic units have changed, aborting diff...
3: In doTryCatch(return(expr), name, parentenv, handler) :
  data/pt/cases_pt.csv: Geographic units have changed, aborting diff...

jeanpaulrsoucy commented 1 year ago

The function creating the "diffs" dataset should report the geographic regions at issue when aborting the update:
2: In doTryCatch(return(expr), name, parentenv, handler) :
  data/hr/cases_hr.csv: Geographic units have changed, aborting diff...
3: In doTryCatch(return(expr), name, parentenv, handler) :
  data/pt/cases_pt.csv: Geographic units have changed, aborting diff...

This issue can be closed if the above functionality is added.

jeanpaulrsoucy commented 1 year ago

Another bug with the diffs dataset is when a bug or change in data source results in the new date being older than the old dates, you end up with strange results. Perhaps if this occurs, the diff should be recalculated or ignored.

jeanpaulrsoucy commented 12 months ago

Short of updating manged diffs after major historical data updates, this issue can be considered completed.