OCHA-DAP / pa-anticipatory-action

Code and documentation for analytical work on OCHA Anticipatory Action pilots.
GNU General Public License v3.0
14 stars 0 forks source link

Ner 2023 perf check #323

Open joseepoirier opened 1 year ago

joseepoirier commented 1 year ago

Generated the report for the 2023 metrics. Copied the original Rmd and changed paths to reflect the new data being read in and to save the plots in a separate folder.

@turnerm Not much to review I think but making sure another set of eyes is seeing this so we can merge.

turnerm commented 1 year ago

Hey Josée, thanks for checking with me on this. The report looks good but it made me realize an issue that I had also encountered with PHL. Basically, if you have very few of a particular outcome, say FN, then it's possible to draw re-samples without any FNs. From these re-samples you would estimate a recall of 100% -- which you know is not an actual possible value, since your original sample does contain at least one FN. This can lead to incorrect statements, for example the report now says "At least some activities are expected to be implemented ahead of 89 - 100% of shocks", however, we know from our sample that the upper bound isn't really 100%.

Therefore, I've now updated the bootstrap algorithm to check and and make sure that at least one of each outcome is present in the re-sample, otherwise, re-draw. However I'm on the fence whether this is really the correct thing to do. On the one hand, it's just excluding metrics that we know are false. But on the other hand, it's perhaps not capturing the true sample variability, and it's the report wording that is the issue. I'm leaning towards the former, but too tired now to make a good decision I think -- will update you tomorrow!

joseepoirier commented 1 year ago

Hi Monica, very good point and solution (or beginning of one!)

In this case the performance would be degrading with the 2023 methodology because we started at.. 100% detection rate, which never felt accurate to report. I'm also not sure if we want to require a non-zero count for one or all categories. Maybe we work through this together unless you reach clarity on your own? It can wait until you are back but will be time-sensitive when you return. I've already flagged to colleagues at HQ that the performance degrades. It'd be okay to correct the metric values with them if we re-run this differently, but we need to feel confident in whatever figures (with or without requiring at least 1 of each category for example) no later than first week of Jan.

Thank you!

turnerm commented 1 year ago

Hey Josée, this has taken me down quite the rabbit-hole, below I'm collecting some of the links I've found along the way. But first to address your point:

In this case the performance would be degrading with the 2023 methodology because we started at.. 100% detection rate, which never felt accurate to report.

True, withouot having any FNs in the sample, we have no way to rule out 100% as the upper bound. I would frame at as now we have more or different information, so the confidence interval has shifted.

Anyway back to the issue, this answer makes me think that what I proposed is not a good idea. However we may need to re-evaluate how we're computing the CIs in general, some info I found below:

  1. Through my reading I discovered that you're not supposed to use the percentile method to compute boostrapped Cis, oops. I suspect that if we use a different method that e.g. computes a symmetric interval, we would avoid the 100% issue. However I have to look more into what is valid for our dataset. Here is a paper comparing different methods.
  2. Stratified bootstrapping, e.g. we could separately sample years that have shocks vs those that do not, I suspect that would lead to fewer samples missing FNs.
  3. Finally we could jacknife which is recommended for small samples, I'm not sure how exactly the CI is computed though and if it would solve the problem.

Honestly I am really at the limit of my stats knowledge here and I want to post this question on Stack Exchange, but when I'm back at work so that I can reply quickly to any responses.

but we need to feel confident in whatever figures (with or without requiring at least 1 of each category for example) no later than first week of Jan.

Maybe we can just stick with the original method for now as it's the most conservative (creates the widest CIs). I mean, I don't think it's necessarily wrong to have 100% in the CI, it just feels off when we report it that way now.

joseepoirier commented 1 year ago

Stellar Monica, thank you so much for the thoughtful response. I'm 100% with you on all this.

For now the stats were not communicated externally yet, only mentioned to 2 HQ colleagues as a heads-up for a conversation about them. I mentioned degradation because the CIs shifted in the "wrong" direction (ie worse FAs and MIS), which for now I consider enough to bring stakeholders in on a conversation about the impact of the trigger changes. They're both on leave for the holidays anyways, so we'll pick this back up when you return with the full final figures.

caldwellst commented 1 year ago

Super interesting conversation!