cmu-delphi / www-covidcast

Front end for interactive visualizations powering the COVIDcast website.
https://delphi.cmu.edu/covidcast/
MIT License
13 stars 2 forks source link

"compared to previous week" percentages are high even when absolute change is low #1178

Open duanecmu opened 2 years ago

duanecmu commented 2 years ago

On the COVIDcast dashboard for Allegheny County the current deaths (relative change to 7 days ago) are displayed as a very large percentage change (at this time we took the screenshot it was +424 .0% change in number of deaths.) @RoniRos suggested seeing this large number may be confusing as at first glance it appears deaths are dramatically increasing when the number went only from 0 to 1-2 deaths. It may be less confusing for viewers to see N/A for such small changes.

Go to https://delphi.cmu.edu/covidcast/?region=42003 for Allegheny County or use any other county dashboard.

Included screenshot from June 1 for Allegheny County as an example. When the deaths moved from 0 deaths to 1-2 the viewer sees it jumps up by a huge number like +424% for this example.

screenshot

Rating scale 1-2 minor issue

krivard commented 2 years ago

This may actually be less a small-counts issue and more a batch-reporting issue, which we already know is common in this dataset. Here's a view of the raw death counts for the same region -- it seems the actual increase in incident deaths between May 23 and May 30 is not 1-2, but 15-20. Depending on the reason for the spike on May 25, this may be a good candidate for the anomalies spreadsheet that feeds annotations in the web visualizations.

It is still worth discussing whether to censor certain information for small-population regions or for small-count signals. We should decide:

duanecmu commented 2 years ago

Added Roni so he can follow the discussion. @RoniRos

RoniRos commented 2 years ago

This may actually be less a small-counts issue and more a batch-reporting issue, which we already know is common in this dataset. Here's a view of the raw death counts for the same region -- it seems the actual increase in incident deaths between May 23 and May 30 is not 1-2, but 15-20.

Thanks! Based on the raw counts you shared:

In any case, my point is that percentile change starting from a total 7day count of 1 is uninformative, and arguably misleading or at least distracting. We can decide not to calculate percentile change if the previous 7day total is less than, say, 10. Note that the condition is only on the previous 7day total (the denominator in the percentile calculation), not the current 7day total.

Depending on the reason for the spike on May 25, this may be a good candidate for the anomalies spreadsheet that feeds annotations in the web visualizations.

True. But note that this is fairly orthogonal to my point. My point would have been the same if in the most recent 7days, instead of (0,21,0,0,0,0,0), we had, say, (2,4,3,4,3,2,3).

It is still worth discussing whether to censor certain information for small-population regions or for small-count signals. We should decide:

  • what our choice to focus on data power users means in this case. +424% is the real actual relative change -- should we not expect data power users to be familiar with the distinction between relative and absolute change? censoring this information hides it from everyone, not just those who may be confused by it. is that fair to people who know what they're looking for?

Actually, I prefer not to censor counts, merely to avoid displaying percentages when they are based on a small-count denominator.

  • if we do go ahead with censoring this information: whether to censor the figure (top row) or change since last week (bottom row) or both

Definitely not censor the figure (top row). That figure is based on the current 7day total, which may actually be quite large. But even if it's small, I wouldn't censor it

  • whether population or raw count or both should be the determining factor, and what the thresholds should be

Raw count, and I suggest <10. I don't think population size is very relevant to this issue, except that low-pop counties are more likely to have low raw counts.

  • what to display instead (I recommend against "NA" because we're already using NA here to mean unavailable, as opposed to un-meaningful or confusing)

I agree, and suggest something like "Small Counts", maybe in a two-line, tiny font like the one we use for "per 100k". This will hopefully become recognizable as an icon that means "not calculated because small counts make this value uninformative".

krivard commented 2 years ago

It is still worth discussing whether to censor certain information for small-population regions or for small-count signals. We should decide:

  • what our choice to focus on data power users means in this case. +424% is the real actual relative change -- should we not expect data power users to be familiar with the distinction between relative and absolute change? censoring this information hides it from everyone, not just those who may be confused by it. is that fair to people who know what they're looking for?

Actually, I prefer not to censor counts, merely to avoid displaying percentages when they are based on a small-count denominator.

I was talking about censoring any information, not just counts. I don't understand how avoiding displaying percentages is different from censoring those percentages. If the distinction is important to you, could you explain?

  • whether population or raw count or both should be the determining factor, and what the thresholds should be

Raw count, and I suggest <10. I don't think population size is very relevant to this issue, except that low-pop counties are more likely to have low raw counts.

I've looked into what it would take to do this, and we have a few options. The change since last week display is based on the covidcast/trend endpoint of the Epidata API, which gives output that looks like this:

# Query: https://api.covidcast.cmu.edu/epidata/covidcast/trend?
#  signal=jhu-csse:deaths_7dav_incidence_prop
#  &geo=nation:*
#  &date=20220612
#  &basis_shift=7
#  &window=20220213-20220613
{
    "geo_type": "nation",
    "geo_value": "us",
    "date": 20220612,
    "value": 0.1152074,
    "basis_date": 20220605,
    "basis_value": 0.080945,
    "basis_trend": "increasing",
    "min_date": 20220603,
    "min_value": 0.0756771,
    "min_trend": "increasing",
    "max_date": 20220213,
    "max_value": 0.7274768,
    "max_trend": "decreasing"
}

The above is taken from the actual query performed by the frontend in determining the "change since last week" for deaths and results in "+42.3%" (=value/basis_value-1). Since it queries 7dav prop and not raw counts, we could:

pinging @sgratzl to weigh in

RoniRos commented 2 years ago

Revisiting this issue.

I understand and appreciate the overhead incurred by these solutions. I am not happy about it, but am also not happy about letting "+424.0%" stand; it doesn't reflect well on our system.

Since the covidcast/trend endpoint is meant to calculate and communicate about trends, and going from 1 to 3 is not quite a trend in the way that going from 300 to 900 is, I think your second option is generally the preferred one: trend should know when ratios are based on small counts and are therefore unreliable, and should signal this condition appropriately, maybe using a special non-numeric value.

I understand this is not trivial to do right. Let's let this issue sleep until we have to revamp related code for other needs, too.