Open joseepoirier opened 1 year ago
Hey Josée, thanks for checking with me on this. The report looks good but it made me realize an issue that I had also encountered with PHL. Basically, if you have very few of a particular outcome, say FN, then it's possible to draw re-samples without any FNs. From these re-samples you would estimate a recall of 100% -- which you know is not an actual possible value, since your original sample does contain at least one FN. This can lead to incorrect statements, for example the report now says "At least some activities are expected to be implemented ahead of 89 - 100% of shocks", however, we know from our sample that the upper bound isn't really 100%.
Therefore, I've now updated the bootstrap algorithm to check and and make sure that at least one of each outcome is present in the re-sample, otherwise, re-draw. However I'm on the fence whether this is really the correct thing to do. On the one hand, it's just excluding metrics that we know are false. But on the other hand, it's perhaps not capturing the true sample variability, and it's the report wording that is the issue. I'm leaning towards the former, but too tired now to make a good decision I think -- will update you tomorrow!
Hi Monica, very good point and solution (or beginning of one!)
In this case the performance would be degrading with the 2023 methodology because we started at.. 100% detection rate, which never felt accurate to report. I'm also not sure if we want to require a non-zero count for one or all categories. Maybe we work through this together unless you reach clarity on your own? It can wait until you are back but will be time-sensitive when you return. I've already flagged to colleagues at HQ that the performance degrades. It'd be okay to correct the metric values with them if we re-run this differently, but we need to feel confident in whatever figures (with or without requiring at least 1 of each category for example) no later than first week of Jan.
Thank you!
Hey Josée, this has taken me down quite the rabbit-hole, below I'm collecting some of the links I've found along the way. But first to address your point:
In this case the performance would be degrading with the 2023 methodology because we started at.. 100% detection rate, which never felt accurate to report.
True, withouot having any FNs in the sample, we have no way to rule out 100% as the upper bound. I would frame at as now we have more or different information, so the confidence interval has shifted.
Anyway back to the issue, this answer makes me think that what I proposed is not a good idea. However we may need to re-evaluate how we're computing the CIs in general, some info I found below:
Honestly I am really at the limit of my stats knowledge here and I want to post this question on Stack Exchange, but when I'm back at work so that I can reply quickly to any responses.
but we need to feel confident in whatever figures (with or without requiring at least 1 of each category for example) no later than first week of Jan.
Maybe we can just stick with the original method for now as it's the most conservative (creates the widest CIs). I mean, I don't think it's necessarily wrong to have 100% in the CI, it just feels off when we report it that way now.
Stellar Monica, thank you so much for the thoughtful response. I'm 100% with you on all this.
For now the stats were not communicated externally yet, only mentioned to 2 HQ colleagues as a heads-up for a conversation about them. I mentioned degradation because the CIs shifted in the "wrong" direction (ie worse FAs and MIS), which for now I consider enough to bring stakeholders in on a conversation about the impact of the trigger changes. They're both on leave for the holidays anyways, so we'll pick this back up when you return with the full final figures.
Super interesting conversation!
Generated the report for the 2023 metrics. Copied the original Rmd and changed paths to reflect the new data being read in and to save the plots in a separate folder.
@turnerm Not much to review I think but making sure another set of eyes is seeing this so we can merge.