OCHA-DAP / pa-anticipatory-action

Code and documentation for analytical work on OCHA Anticipatory Action pilots.
GNU General Public License v3.0
14 stars 0 forks source link

Chad forecast performance - rule of three? #319

Closed caldwellst closed 1 year ago

caldwellst commented 1 year ago

Hello @turnerm!

We are revisiting the possibility of calculating performance statistics on the predictive trigger for Chad. This trigger is based one 4 activation points from March to June, the July-August-September seasonal forecast. We only had 5 years of historical data, and there was no severe shock in this time period and also no meeting of the threshold. I re-read some of the materials on the rule of three and wanted to get your thoughts on applying them here, as I have a few concerns on how we might use them to fill out the full trigger performance card:

I think we've discussed previously with Tinka and it was decided not to calculate any metrics. However, I wanted to put this here so we had it in writing and start using better documentation for our discussions!

turnerm commented 1 year ago

Hm so are you saying that the statistics we have for Chad are just 5 TNs? Yikes!

For using the rule of 3 in Niger, a couple of things were different:

Will think about it more though!

caldwellst commented 1 year ago

So, was trying to access the paper but couldn't. I'm thinking today, the problem is that the lack of TP or FN is not probabilistic at all but deterministic since there were no true positives in the validation dataset, right? We couldn't consider each year's validation as a Bernoulli trial, basically, since the probability was 0 they would occur. Hmm anyway fun to think about it, let's keep the conversation going.

turnerm commented 1 year ago

I'm thinking today, the problem is that the lack of TP or FN is not probabilistic at all but deterministic since there were no true positives in the validation dataset, right? We couldn't consider each year's validation as a Bernoulli trial, basically, since the probability was 0 they would occur

I'm not totally sure I understand what you mean, but let me know if this is along the lines of what you're saying:

For the case of Niger, the Bernoulli trial was set up as success being (TP, TN, FP) and failure being (FN), which allowed us to get the CI of FN frequency. However since we only have TNs for Chad, we need to define (TN ) as a success, and (TP, FP, FN) as a failure. So any frequency we get would contain three metrics lumped together and not be very helpful to inform the framework. Hope that makes sense.

caldwellst commented 1 year ago

Ooh nice, I like that formulation and really helps. What I'm saying though, is in the case of Chad, we only have TNs, but there was no trial for TP or FN, because there were 0 shocks in the period. There was no probability those would ever occur, so the setup would solely be for TN as a success, and FP as a failure. So we could only get frequency for that metric and none of the others. Does that make sense?

turnerm commented 1 year ago

What I'm saying though, is in the case of Chad, we only have TNs, but there was no trial for TP or FN, because there were 0 shocks in the period. There was no probability those would ever occur, so the setup would solely be for TN as a success, and FP as a failure. So we could only get frequency for that metric and none of the others. Does that make sense?

I think for the application of the rule of 3 I would see it a bit differently, because my understanding is that it's supposed to apply for future trials, which in our case would be additional years, in which shocks could have occurred.

Since our sample contains no shocks, you could make the assumption that they never occur in which case you could compute FPs, which I think is consistent with what you're describing.

caldwellst commented 1 year ago

So, been thinking about it more, and I am not actually sure about bunching up TP, TN, and FP and then calculating FN for instance. I think the rule of three is applied for things like trials of patients where we expect there to be a population-level prevalence (or some statistic) we want to measure through this trial, assuming probability is equal across the n in the trial. But there is literally 0 probability of a TP or FN if there was no shock.

I would think if we apply the rule of three in our work, we should calculate the rule of three for TN and FP together as success/failure trials and TP and FN together as success/failure trials. Essentially considering the years where a shock did or did not occur as separate populations. This means our bounds are going to be wider than they are now, but I think this is the correct approach.

I don't think we can say anything about the probability of TP or FN for Chad. I think we can use rule of three to estimate the FP/TN confidence bounds. I also would not be confident in using the rule of three approach on the consecutive trigger points because they are not independent, but we could hand wave that if really necessary to get bounds for the FP/TN at the trigger level (I would still vote not to).

joseepoirier commented 1 year ago

Hi both, great to see this discussion happen in more depth (and being documented!) It might help to have a live discussion about it to clear up the direction and pros/cons of approaches we come up with. (And then continue the back-and-forth in GH issues.)

I wanted to step back for a moment and ask:

  1. The first limitation seems to come from the short forecast history. Can we get around that (ie did we consider other seasonal products with longer history? I think IRI's own flexible forecasts are now available and do have longer history.)
  2. How do we define an event (ie a "drought" year)? Do we have a history longer than 5 years / with more than 0 events? In other words, what can we share about drought frequency and severity in the country beyond the strict constraints of the framework? Is there a superset that is relevant to examine as a way of inferring anything about the subset of shocks targetted by AA?
  3. As Monica alludes to, we are looking to estimate the likelihood of activation, hits, and false alarms once we start monitoring the trigger (ie in the future.) Setting aside the current approach, what else might we be able to give stakeholders to help them evaluate cost/benefit of doing AA if those metrics were not calculable?
joseepoirier commented 1 year ago

I'm thinking today, the problem is that the lack of TP or FN is not probabilistic at all but deterministic since there were no true positives in the validation dataset, right? We couldn't consider each year's validation as a Bernoulli trial, basically, since the probability was 0 they would occur

I'm not totally sure I understand what you mean, but let me know if this is along the lines of what you're saying:

For the case of Niger, the Bernoulli trial was set up as success being (TP, TN, FP) and failure being (FN), which allowed us to get the CI of FN frequency. However since we only have TNs for Chad, we need to define (TN ) as a success, and (TP, FP, FN) as a failure. So any frequency we get would contain three metrics lumped together and not be very helpful to inform the framework. Hope that makes sense.

Also reluctant to use FN as a success since 1) it is the type of event we are targetting with AA 2) FN are essentially the absence of something rather than confirmatory evidence.

turnerm commented 1 year ago

Sorry, a bit of a wall of text below! First responding to @caldwellst:

So, been thinking about it more, and I am not actually sure about bunching up TP, TN, and FP and then calculating FN for instance.

Just to clarify as the above is not what I meant: rather, because our sample only contains TN, I'm suggesting that we define TN as a success, and define failure as [not TN]. We would then be calculating the confidence on the frequency of [not TN]. But more about that below:

I think the rule of three is applied for things like trials of patients where we expect there to be a population-level prevalence (or some statistic) we want to measure through this trial, assuming probability is equal across the n in the trial. But there is literally 0 probability of a TP or FN if there was no shock.

Emphasis is mine -- I think that is the main point of confusion here (one that I also had) which I will attempt to clarify.

What confused me initially is the fact that our assumed underlying population (TP, TN, FP, FN) actually depends on two other distributions: the occurrence of a shock, and the performance of the model. However, thinking about this problem made more sense to me once I realized that you don't actually need to worry about these underlying drivers, and can simply consider the (TP, TN, FP, FN) population on its own.

I am assuming here that we believe shocks to be possible in our population, even though there are none in our sample. Under this assumption, there is absolutely a possibility of TP / FN in a future year.

Furthermore, while the rule of three was developed for clinical trials, the derivation is quite general and assumes a binomial distribution, in which each Bernoulli trial has only two possible outcomes, usually labelled "success" and "failure" (but this can be defined however you want). And we can definitely cast our problem into a Bernoulli one: for Niger, we defined success = [not an FN], and failure = FN. For Chad, we have that success = TN, and failure = [not a TN]. We can then further define [not a TN] however we want to. If we set it to just (FP) then we are saying that we don't think it's possible for this shock to ever occur, or that the results only apply to years where there is no shock.

If we set it to (TP, FP, FN) then we are assuming that our underlying population contains these members, and thus implicitly that we believe shocks can occur in the future. Taking this further and actually applying the rule of three, for our Chad sample we could place a 95% confidence upper limit on the frequency of [not a TN] at 3/5 = 60%. Which means that, for example, next year, we have at most a 60% chance (95% confidence) of either a TP, FP or FN occurring.

I hope this all makes sense -- we can also chat bilaterally tomorrow if you'd like!

I would think if we apply the rule of three in our work, we should calculate the rule of three for TN and FP together as success/failure trials and TP and FN together as success/failure trials. This means our bounds are going to be wider than they are now, but I think this is the correct approach

We could do this, but then we would be computing something slightly different -- for example in the first case, it would be: what is the probability of a false alarm given that there is no shock that year.

I don't think we can say anything about the probability of TP or FN for Chad. I think we can use rule of three to estimate the FP/TN confidence bounds.

Our confidence limit would be on [not FN], so we just need to define that.

I also would not be confident in using the rule of three approach on the consecutive trigger points because they are not independent, but we could hand wave that if really necessary to get bounds for the FP/TN at the trigger level (I would still vote not to).

Agree with your vote!

@joseepoirier really good points, will start to think about them!

Also reluctant to use FN as a success

As I mentioned in my diatribe above, the "success" and "failure" labels are just names for defining p in the Bernoulli trial, it doesn't have to literally mean those things. How we define them is set by our assumptions on the underlying population.

caldwellst commented 1 year ago

Let's have a chat this morning!

I just don't believe we have an underlying population with a single probability p. If we set success as (TP, FP, FN), do we really expect p to be the same for years where there is a shock (and we can have a TP or FN) or not a shock? I just don't think we can consider them a single population with comparable Bernoulli trials.

Going to the clinical trial example, imagine if the rule of three was applied for exactly these metrics, but for a cancer detection test. They know the actual status of each person in the trial. Let's say it had no false positives in the population. Would they really calculate the confidence bounds for false positives using the entire trial participants, or just for those without cancer where there's a meaningful probability p for a false positive? This is how I'm thinking about our problem. I don't think they can be considered 1 unique population because even if we lump metrics together, I do not think p is the same across the two groups.

turnerm commented 1 year ago

Just a quick response to the above and tying it into what we discussed when we spoke:

I just don't believe we have an underlying population with a single probability p. If we set success as (TP, FP, FN), do we really expect p to be the same for years where there is a shock (and we can have a TP or FN) or not a shock? I just don't think we can consider them a single population with comparable Bernoulli trials.

We don't know in any given year a priori if there will be a shock, and my proposal is to make a statement for this general case. That being said, I think what you are getting at is that we can also say something for years when we know there won't be a shock, and indeed the probabilities are different in the sense that the underlying population is changes once you only consider this case (only TN or FP possible).

Going back to the example I made above, we could also say that for Chad, in the years where no shock takes place, there is at most a 60% chance (95% confidence) of a false activation.

Going to the clinical trial example, imagine if the rule of three was applied for exactly these metrics, but for a cancer detection test. They know the actual status of each person in the trial. Let's say it had no false positives in the population. Would they really calculate the confidence bounds for false positives using the entire trial participants, or just for those without cancer where there's a meaningful probability p for a false positive? This is how I'm thinking about our problem. I don't think they can be considered 1 unique population because even if we lump metrics together, I do not think p is the same across the two groups.

Similar to the above, I think it depends on what you want to report. Do you want to provide the FPR for applying the test to the general population, or to people who don't have cancer? In general since when using the test in the real world you wouldn't know a priori who has cancer, I would say that you would ideally want to report the former (which incidentally would not be possible to compute given the results of the trial). (And by the way, terrible drug trial design Seth! :joy: )

caldwellst commented 1 year ago

Hahaha don't blame my terrible clinical trial design!

Obviously in the real world we don't know a priori who has cancer, but I think all assessments of cancer detection techniques measure their performance using specificity, sensitivity, recall, etc. because it presents meaningful statistics. The same was during COVID, where the metrics on test performance were communicated to the general population as likelihood of false positive if you didn't have the virus, for instance. I would say that we want to provide the FPR to the people who don't have cancer, and specificity is the standard metric for reporting that.

However, to note, all of our reporting of performance follows the above. We are not reporting on all general years. We report sensitivity and recall, with specific denominators of total positives or total shocks, not all years. If we use rule of three to assess across all years we need to change the reporting in the template I think and our explanation.

image

turnerm commented 1 year ago

Oh shoot Seth, your last message about the definition of specificity made me realize that the population size (the n in 3/n) actually drops out if you compute any metrics with a denominator :woman_facepalming:. Sorry, I should have realized this earlier, I think I actually went through this calculation for Niger and then forgot.

This means that my whole hangup about defining the underlying population is only relevant if you're quoting the raw frequency, which we are not, oops. Since we are solely interested in metrics, these only depend on their respective sub-populations, as you've been saying, and is now painfully obvious to me. Sorry again!

turnerm commented 1 year ago

Revisiting this before our team meeting, and realized that I was wrong again! :sweat_smile: 3/n is just an approximation; using the exact formula (1-(1-p)^(1/n)), n doesn't drop out.

But luckily, I think it still doesn't matter too much, at least if we are sticking to Chad. Because if the upper limit for the frequency of [TP, FP, FN] is 3/n, then we can say that the upper limit of the frequency of TP is also 3/n. So we can put a bound on the FAR without the caveats that I was specifying.

caldwellst commented 1 year ago

Okay interesting. Would you mind just writing out the pseudocode for how you calculated it and I can convert? Or if easily understandable just a link to where you've done it for Niger!

castledan commented 1 year ago

Very interesting discussion, guys. I finally had the time to read everything.

I agree with your conclusions: I think we can estimate a confidence interval for the frequency of FP only for years where shocks do not occur.

If our experiment is a Bernoulli trial, with the following definitions:

we can assume that the corresponding probabilities do not change in time, and use the rule of three (or its equivalent) to estimate the confidence interval for the event failure, hence the composite probability of TP, FP and FN.

Let's now consider that our Bernoulli trial is defined as:

we can still assume that the corresponding probabilities do not change in time, and use the rule of three (or its equivalent) to estimate the confidence interval for the event failure, hence the probability of FP, for years where shocks do not occur.

The only thing I could not understand is what Monica refers to in her last two messages. @turnerm, what do you mean with the population size "dropping out"?

turnerm commented 1 year ago

Would you mind just writing out the pseudocode for how you calculated it and I can convert? The only thing I could not understand is what Monica refers to in her last two messages. @turnerm, what do you mean with the population size "dropping out"?

I think I can answer these both:

Let's say you have a cancer detection sample of size 1000 with 100 TP, 800 TN, 100 FN, and no FP. You want to estimate the maximum FPR, so you first need to get the frequency of FPs using the rule of three. The question is, which $n$ do you use: 1000 to consider the full population, or 800 (the number of TNs) to only consider people without cancer?

The point that I was trying to make is that it doesn't really matter, because once you plug the numbers into the FPR equation, the $n$ drops out. If you have FPR = FP / (FP + TN), then in order to plug in your estimate of FP, you need to convert the frequency that you estimated from the rule of three $p_{\text{FP}}=3/n$ to a number, which means multiplying by $n$. Or alternatively, you could keep FP as a frequency and convert TN to a frequency by dividing by $n$. Either way, the $n$ will drop out of the FPR equation, and thus you get FPR = 3 / (3 + TN).

However, the exact form of the rule of 3, assuming you have a sample with all successes, is: $$p \leq 1 - (1 - \text{CI})^{1/n}$$ where $p$ is the probability of a failure, $n$ is the sample size, and $\text{CI}$ is the confidence interval. (With a CI of 95% you can Taylor expand to get $3/n$ on the RHS). So my second point was that in principle the FPR result should actually depend on the choice of $n$, but in practice the exact number would change very little.

caldwellst commented 1 year ago

Thanks Monica, all clear! I think without even delving into what those differences might be using $3/n$ vs. the full formula, I think what's clear from your example is that we basically require that at least 1 of the 2 quantities be observed in our sample. Hence why we cannot calculate any of the statistics for Chad because we don't have any shocks or activations in the dataset.

castledan commented 1 year ago

Thanks Monica, that is clear.

turnerm commented 1 year ago

I think without even delving into what those differences might be using vs. the full formula, I think what's clear from your example is that we basically require that at least 1 of the 2 quantities be observed in our sample. Hence why we cannot calculate any of the statistics for Chad because we don't have any shocks or activations in the dataset.

Sorry to keep beating this dead horse, but I just wanted to clarify that I think we could still put a limit on the FPR (also referring back to this comment) since in the case of Chad we have TN, and from the above, FPR $\leq$ 3 / (3 + TN) $\lesssim$ 38%. But not sure if it's worth reporting this one metric.

caldwellst commented 1 year ago

No, it's good to be clear! I think for me, tying these ends together, the issue is that we are not talking about the FPR as defined by FPR = FP / (FP + TN) in our trigger reports, according to the template developed for Niger. We report the false positives as a percent of total activations. Thus the statistics we report are FP / (FP + TP), which is the inverse of precision, whatever that is called. This is the first row of bars in the plot below, precision and in red, 1 - precision.

image

The bottom bar is presenting metrics for recall and its inverse, so TP/FN / (FN + TP).

This is why I was saying we wouldn't report anything for Chad, because we don't present the FPR in the way you've defined above. But I do agree we theoretically could. However, I would rather stick to not presenting anything given the complete uncertainty and our desire to present a cohesive set of metrics across our pilots.

Note that the top-level metrics in the report are the same (precision/recall and their inverse).

image
turnerm commented 1 year ago

Ah yes that's all clear, agreed!

caldwellst commented 1 year ago

We didn't apply for Chad. Final project report created in #321