episphere / mortalitytracker

tracking causes of death from CDC data APIs
4 stars 8 forks source link

Combined State totals for excess mortality might be skewed by negative totals #11

Closed djhopkins2 closed 4 years ago

djhopkins2 commented 4 years ago

Looking through all of the data for each of the states and I've noticed several (Ohio is a good example) where when the excess mortality drops below the average, it causes the summed "additional total" to drop or even go negative. Thinking logically about this, it seems like that would be undesirable if you were wanting to use that total to to try and compare with actual covid totals to see if there's a discrepancy in the reported death counts. Also, when adding this to the US total, any states that decrease the sum for that period will also negate some of the additional deaths in the US total. Obviously, this excess mortality data isn't a direct count of covid related deaths but it provides insight into deaths that might have been missed in the official tallies or possible underreporting of deaths.

What I was thinking was that instead it would be worthwhile to only increase the additional deaths sum for each state when the 2020 deaths are above the average for that time period. Anything below the average would just be ignored instead of subtracting. That way you're capturing abnormal spikes only. Then, the individual totals could be summed to provide the US total. Maybe it's worth providing an extra plot for the above average only total if you don't want to lose how the current additional total was being displayed.

I am thinking about forking this repo and testing the change just to see how it affects the results. I mostly noticed this issue because some of the earlier states at first were missing deaths but eventually they got their testing and reporting squared away and the death data they release seems to pretty closely match the excess deaths. However, now that they are dropping, and sometimes going below average deaths, it seems to be masking the excess from some of the states that have not been known for their good data reporting.

jonasalmeida commented 4 years ago

Thank you for your thoughts and suggestions @djhopkins2 !

Indeed your suggestion is right on spot, some states are missing data to the point where one has to consider them incomplete. Four of them, NC, CT, PA, ND, are actually removed from the counts. You can change that by clicking on "Include states with incomplete records":

Screen Shot 2020-06-30 at 1 36 51 PM.

There is some talk of letting a numerical trigger decide on the optional exclusion. I've added your suggestion to the list being considered.

and yes, the forked tool should work on your side just as is. Please let us know if that is not the case.

cheers!

Jonas

djhopkins2 commented 4 years ago

I may have not been clear on my first post in this issue, sorry. I wasn't mentioning states with missing data. I'm particularly calling out florida and the questions regarding their data (the fired data scientist for not tweaking the data while creating their dashboard). They have data in this dataset.

That's beside the point. I noticed an issue with the excess "additional death" cumulative sum. If you look at individual state data, you'll notice that when the actual deaths drops below the historical average, the summation that accumulates the total "additional deaths" adds the negative deaths and decreases the total. An example would be Ohio. I'm arguing that that might skew the total since arguably we're looking for abnormal death spikes above the average and that ones below the average shouldn't "erase" accumulated abnormal spikes. You're not going to decrease your accumulated total of outlier deaths when a week goes below average.

Ohio shows this where the total goes negative and if sum that into the us, that drops the US total by easily 1000. image Oklahoma, West Virginia, and Puerto Rico are other examples where below average data points are skewing the "Excess" total.

I found a line of code where it looks like this calculation is occuring. I'm tagging it for later if I just decide to fork this and experiment. Excess.js line 264 and https://github.com/episphere/mortalitytracker/blob/1bb9631dfaa6e3004fafd3dbf454875cffb12eee/deathtracker.js#L606 What I'm looking to change is effectively clipping this calculation so that if dataFor2020ForCause[week] is less than averageForOtherYearsPerWeek[week], don't do the subtraction and just clip it at zero.

Also, I want to look at doing this clipping at the state level first and then summing the weekly accumulated totals for each state to arrive at the total for the US. I'm still studying the code to see if this is how the data is structured or not.

djhopkins2 commented 4 years ago

@jonasalmeida I think you closed this issue thinking I was pointing out a different problem... Can you reopen this?

djhopkins2 commented 4 years ago

Studied your code a bit more and it is a bit more involved to correct for the negative excess deaths getting into the total. The code uses the US totals for historical averages and 2020 to do the math to arrive at the US aggregate total excess deaths. I'm going to have to modify the code to look at each jurisdiction, come up with each weekly historical average per jurisdiction, and then only calculate the excess if it above average for that week in that jurisdiction. Then take those weekly excursion and sum them for the whole US. What I'm looking to do is at least remove the drops below average that some areas had when people started taking precautions that are skewing the US totals.

I took the raw data and recreated the calculation that the current mortality tracker uses and applied my fix to the calculation as well and I'm finding ~20,000 extra deaths that were getting negated. I even excluded the first week of this year since there was a large spike in all of the data that was likely not from covid. image

One final note, this seems to in some ways fix the issue with partial state data, since I didn't exclude any state from the data when calculating the total.