Percentiles for multi-select are inaccurate

jakemsnyder commented 7 years ago

Right now, when a user selects multiple counties, it averages the percentiles to get the new value. Mathematically, this is incorrect. We need to recalculate based on the estimates.

Per SVI documentation, CDC uses the excel function PERCENTRANK.INC on the corresponding EP field with 4 significant digits. Unfortunately (most of) the EP fields are also a percentage.

To be 100% accurate, we will need to go back to the estimate field, recalculate the percentage for that multi-county/tract selection, then calculate the percentile using that new percentage as it compares to the rest of the counties or tracts. The EP field calculation is also in the documentation above.

Including this data will not be possible at the tract level, as our mbtiles file is already at the maximum file size. We could do this at the county level though.

At the tract level, some fields we will be able to do this anyway (EP_PCI is the estimate, not the percentage, so we can still aggregate this accurately). Some fields we just need to multiply by the population estimate to get the correct estimate. But there will be some fields we cannot calculate an accurate percentile (ex. EP_CROWD, which requires the estimated household units as the quotient).

I'll work on identifying which fields we can calculate with the data we already have (and how to do so), and which fields we cannot calculate accurately with our current data.

rmcarder commented 7 years ago

Oh, right, we only have the percentile data in the tracts and not the raw rate. The concept of an averaged percentile doesn't make sense at all, actually. You would have to treat all the counties selected as one, create a weighted average of the raw rates of the ones selected, and then recalculate percentile values to see where your block of selected counties stood relative to a group not including them as individual counties.

tingaloo commented 7 years ago

Would this be simple a simple manipulation of the data passed within updateSidebar or something more involved? I'm not too familiar with how the sidebar values are evaluated from the original data, but I could give it a shot.

rmcarder commented 7 years ago

I think more involved, and to a considerable extent. The sidebar values are percentile. So for example, a .97 ion the minority category means that the area has a higher percentage of minorites than 97% of the other counties/tracts in the country. Multi-select would greatly complicate this because not only back to the raw values of the indicator and calculate an average value weighted by population, but you would also have to remove the counties you had selected from the pool you are comparing to in determining percentile. Aside from the complexity in making this calculation, it seems pretty intensive to be doing for a sidebar that needs to updating just about instantly. Jake, do you agree? How about we put off until another conversation with Zach and explain to him, and see what he would like us to do given these constraints?

On Wed, May 24, 2017 at 1:19 PM, Lew notifications@github.com wrote:

Would this be simple a simple manipulation of the data passed within updateSidebar or something more involved? I'm not too familiar with how the sidebar values are evaluated from the original data, but I could give it a shot.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ccusa/Disaster_Vulnerability_Map/issues/44#issuecomment-303792235, or mute the thread https://github.com/notifications/unsubscribe-auth/AKBZdYyOUfqM94cLjh6Z-vm21MTMKammks5r9GaKgaJpZM4NkUfy .

jakemsnyder commented 7 years ago

I think for most of the metrics (at least half), it's a simple recalculation. If updateSidebar can do a percentile ranking, we should be able to get most of it relatively easily, but we will need to do a separate calculation for each metric.

As for removing the selected counties from the percentile ranking, I think we can keep them in. If we consider the selection as a new county, it's just adding one more to the list. Since we round to the nearest whole percentile, it will be close enough.

I'm going to do some analysis this weekend to see exactly what calculation we would need to do for each metric to accomplish this and will report back. And just as an FYI, it involves going back to the SVI documentation to see how they calculated their metrics.

jakemsnyder commented 7 years ago

I started going through the data to see if we could calculate it ourselves, but the numbers don't line up. Even a metric as straightforward as Below Poverty does not line up 1 to 1. I'm thinking that, since we are only using this for a multi-select, we can just do our best guess. Maybe include a disclaimer that the percentiles for multi-select are estimated and not official?

With that in mind, the calculations below are a best guess using what's in the SVI, and should not be considered the true calculation for the values.

Percentiles calculated as: (Rank - 1) / (N - 1) *use 4 significant digits

The following calculations are the value we need for each county/tract as well as the new selection. Once we get this value, we then rank all of the values, determine percentile, and then use the percentile of the selected counties/tracts to calculate its percentile.

Below Poverty

E_POV/E_TOTPOP

Unemployed

This value relies on persons aged 16 years or older, so the true denominator is not in the SVI. E_UNEMP/E_TOTPOP

Income

E_PCI=EP_PCI=mean income. mean(sum(E_PCIE_TOTPOP for each selected country/tract)) Note that this percentile should go from lowest to highest, as a lower value indicates greater vulnerability

No HS Diploma

This value relies on persons aged 25 years or older. The denominator is not in the SVI. E_NOHSDP/E_TOTPOP

Aged 65 or Over

E_AGE65/E_TOTPOP

Aged 17 or Younger

E_AGE17/E_TOTPOP

Civilian w a Disability

This value relies on persons aged 5 years or older. The denominator is not in the SVI. E_DISABL/E_TOTPOP

Single-Parent Household

E_SNGPNT/E_HH

Minority

E_MINRTY/E_TOTPOP

Speak English less than Well

This value relies on persons aged 5 years or older. The denominator is not in the SVI. E_LIMENG/E_TOTPOP

Multi-unit Structure

E_MUNIT/E_HU

Mobile Homes

E_MOBILE/E_HU

Crowding

The denominator is occupied housing units. Not sure if this is the same as E_HU. E_CROWD/E_HU

No Vehicle

E_NOVEH/E_HH

Appendix A in this doc gave additional insight not in the SVI documentation.

ccusa / Disaster_Vulnerability_Map