ebmdatalab / datalab-pandas

Useful pandas library stuff
MIT License
2 stars 2 forks source link

Decile changes for sro #31

Closed LFISHER7 closed 11 months ago

LFISHER7 commented 1 year ago

OpenSAFELY SRO work uses the deciles chart reusable action, which in turn uses the decile chart function here.

When showing the outer percentiles, I was hitting a bug whereby some percentiles were repeated. This is a proposed fix, but there may be better ways to handle this.

This also refactors the function to compute deciles in passing, in line with previous SRO work by @iaindillingham.

iaindillingham commented 1 year ago

Thanks for this @LFISHER7. Could you point to some examples of the bug? It seems odd that some percentiles were repeated; the context would help me understand your proposed fix.

LFISHER7 commented 1 year ago

Sorry for the delay in responding to this! I've introduced a failing test to highlight the issue and suggested a fix

ccunningham101 commented 1 year ago

Taking a look at this, I think it results from floating point precision with np.arrange i.e.
>>> np.arange(0.1, 1.0, 0.1)[6] 0.7000000000000001

The resulting percentiles are >>> df["percentile"][14]
0.060000000000000005 >>> df["percentile"][15]
0.06999999999999999 and truncated to ints, they are both 6

>>> int(first*100) 6 >>> int(second*100) 6

if round had been used instead, >>> round(100*first) 6.0 >>> round(100*second) 7.0

Another solution could be to do the rounding when creating the arrays

deciles = (np.arange(1, 10) / 10)
bottom_percentiles = (np.arange(1, 10) /100)
top_percentiles = (np.arange(91, 100) /100)
ccunningham101 commented 1 year ago

And maybe the most readable/understandable solution would be to avoid numpy.arrange deciles = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] bottom_percentiles =[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09] top_percentiles = [0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99]

inglesp commented 11 months ago

Closing in favour of #37.