Open duanecmu opened 2 years ago
This may actually be less a small-counts issue and more a batch-reporting issue, which we already know is common in this dataset. Here's a view of the raw death counts for the same region -- it seems the actual increase in incident deaths between May 23 and May 30 is not 1-2, but 15-20. Depending on the reason for the spike on May 25, this may be a good candidate for the anomalies spreadsheet that feeds annotations in the web visualizations.
It is still worth discussing whether to censor certain information for small-population regions or for small-count signals. We should decide:
Added Roni so he can follow the discussion. @RoniRos
This may actually be less a small-counts issue and more a batch-reporting issue, which we already know is common in this dataset. Here's a view of the raw death counts for the same region -- it seems the actual increase in incident deaths between May 23 and May 30 is not 1-2, but 15-20.
Thanks! Based on the raw counts you shared:
In any case, my point is that percentile change starting from a total 7day count of 1 is uninformative, and arguably misleading or at least distracting. We can decide not to calculate percentile change if the previous 7day total is less than, say, 10. Note that the condition is only on the previous 7day total (the denominator in the percentile calculation), not the current 7day total.
Depending on the reason for the spike on May 25, this may be a good candidate for the anomalies spreadsheet that feeds annotations in the web visualizations.
True. But note that this is fairly orthogonal to my point. My point would have been the same if in the most recent 7days, instead of (0,21,0,0,0,0,0), we had, say, (2,4,3,4,3,2,3).
It is still worth discussing whether to censor certain information for small-population regions or for small-count signals. We should decide:
- what our choice to focus on data power users means in this case. +424% is the real actual relative change -- should we not expect data power users to be familiar with the distinction between relative and absolute change? censoring this information hides it from everyone, not just those who may be confused by it. is that fair to people who know what they're looking for?
Actually, I prefer not to censor counts, merely to avoid displaying percentages when they are based on a small-count denominator.
- if we do go ahead with censoring this information: whether to censor the figure (top row) or change since last week (bottom row) or both
Definitely not censor the figure (top row). That figure is based on the current 7day total, which may actually be quite large. But even if it's small, I wouldn't censor it
- whether population or raw count or both should be the determining factor, and what the thresholds should be
Raw count, and I suggest <10. I don't think population size is very relevant to this issue, except that low-pop counties are more likely to have low raw counts.
- what to display instead (I recommend against "NA" because we're already using NA here to mean unavailable, as opposed to un-meaningful or confusing)
I agree, and suggest something like "Small Counts", maybe in a two-line, tiny font like the one we use for "per 100k". This will hopefully become recognizable as an icon that means "not calculated because small counts make this value uninformative".
It is still worth discussing whether to censor certain information for small-population regions or for small-count signals. We should decide:
- what our choice to focus on data power users means in this case. +424% is the real actual relative change -- should we not expect data power users to be familiar with the distinction between relative and absolute change? censoring this information hides it from everyone, not just those who may be confused by it. is that fair to people who know what they're looking for?
Actually, I prefer not to censor counts, merely to avoid displaying percentages when they are based on a small-count denominator.
I was talking about censoring any information, not just counts. I don't understand how avoiding displaying percentages is different from censoring those percentages. If the distinction is important to you, could you explain?
- whether population or raw count or both should be the determining factor, and what the thresholds should be
Raw count, and I suggest <10. I don't think population size is very relevant to this issue, except that low-pop counties are more likely to have low raw counts.
I've looked into what it would take to do this, and we have a few options. The change since last week display is based on the covidcast/trend
endpoint of the Epidata API, which gives output that looks like this:
# Query: https://api.covidcast.cmu.edu/epidata/covidcast/trend?
# signal=jhu-csse:deaths_7dav_incidence_prop
# &geo=nation:*
# &date=20220612
# &basis_shift=7
# &window=20220213-20220613
{
"geo_type": "nation",
"geo_value": "us",
"date": 20220612,
"value": 0.1152074,
"basis_date": 20220605,
"basis_value": 0.080945,
"basis_trend": "increasing",
"min_date": 20220603,
"min_value": 0.0756771,
"min_trend": "increasing",
"max_date": 20220213,
"max_value": 0.7274768,
"max_trend": "decreasing"
}
The above is taken from the actual query performed by the frontend in determining the "change since last week" for deaths and results in "+42.3%" (=value/basis_value-1
). Since it queries 7dav prop and not raw counts, we could:
covidcast/trend
suppress or flag responses where the basis value for the requested signal is related to a corresponding raw counts signal below threshold.
covidcast/meta
endpoint, and correct functioning of Query-Time Computations (JIT) relies on the base name of cases/deaths signals being cumulative. We could store another signal relationship, with loads of options for how to do that -- some would impact additional parts of the frontend through the resulting changes to covidcast/meta
output, others would require separate code, special-casing logic, etc.covidcast/trend
is not a publicly documented endpointpinging @sgratzl to weigh in
Revisiting this issue.
I understand and appreciate the overhead incurred by these solutions. I am not happy about it, but am also not happy about letting "+424.0%" stand; it doesn't reflect well on our system.
Since the covidcast/trend
endpoint is meant to calculate and communicate about trends, and going from 1 to 3 is not quite a trend in the way that going from 300 to 900 is, I think your second option is generally the preferred one: trend
should know when ratios are based on small counts and are therefore unreliable, and should signal this condition appropriately, maybe using a special non-numeric value.
I understand this is not trivial to do right. Let's let this issue sleep until we have to revamp related code for other needs, too.
On the COVIDcast dashboard for Allegheny County the current deaths (relative change to 7 days ago) are displayed as a very large percentage change (at this time we took the screenshot it was +424 .0% change in number of deaths.) @RoniRos suggested seeing this large number may be confusing as at first glance it appears deaths are dramatically increasing when the number went only from 0 to 1-2 deaths. It may be less confusing for viewers to see N/A for such small changes.
Go to https://delphi.cmu.edu/covidcast/?region=42003 for Allegheny County or use any other county dashboard.
Included screenshot from June 1 for Allegheny County as an example. When the deaths moved from 0 deaths to 1-2 the viewer sees it jumps up by a huge number like +424% for this example.
Rating scale 1-2 minor issue