INSP-RH / pifpaf

Estimation of the Population Impact Fraction and Potential Impact Fraction
GNU General Public License v3.0
3 stars 1 forks source link

On problem 3 #7

Closed fcudhea closed 7 years ago

fcudhea commented 8 years ago

I agree the choice of distribution matters for calculating PAF. And there is definitely an advantage to the empirical method where you don't have to worry about that. No need to oversell it with strange scenario where researcher uses Poisson distribution when the data is uniform (who would do such a thing? If they know what a Poisson distribution is, surely they would recognize that data doesn't fit distribution?)! I was actually surprised the bias was only 10%!

No need to do this if point of document is to clarify things for me, but I do think this exercise to see how sensitive PAFs are to different exposure distribution assumptions is interesting, and would be useful if comparing similar distributions (for example: log-normal, gamma and weibull which are all skewed and positive, and you can have true distribution be skewed beta) that reasonable researchers can disagree on. (Could also add normal and/or truncated normal since that would probably be the default distribution for the unthinking/rushed researcher)

I remain somewhat unconvinced that mis-specifying distribution would create so much bias if, say, the true distribution was right-skewed beta and the assumed distribution is gamma, or if the true distribution is truncated log-normal and used distribution is gamma.

fcudhea commented 8 years ago

One more point: Something that I did not think of till recently. I think it's important to note that when using the empirical PAF, you are still making an assumption. That assumption is that the observed distribution from your survey reflects the true distribution of the population. This may not necessarily be the case for certain survey types like single short-term recall, where extreme intakes are most likely be more common than a "true" intake (e.g, you just happen to have 12 servings of soda that day but that is not normal for you and your true intake for the year is actually 3 servings per day). This isn't really a disadvantage over making distributional assumptions since there is no data-driven way (that I know) to adjust for something like this and people making assumptions on distributions will most likely just be fitting the data they have to said distributions anyway. Just wanted to point out that it is still possible for the distribution used in the empirical method to not reflect the real distribution! (but in such a way that it cannot be solved by statistical methods)

RodrigoZepeda commented 7 years ago

That assumption is that the observed distribution from your survey reflects the true distribution of the population.

Yup. And that is the greatest assumption in any statistical method: that the data somehow reflect the reality you want. I don't think there is anything we can do about it do you?

I remain somewhat unconvinced that mis-specifying distribution would create so much bias if, say, the true distribution was right-skewed beta and the assumed distribution is gamma, or if the true distribution is truncated log-normal and used distribution is gamma.

Maybe not. There is no way to say. If you assume your distributions are parametric you can always estimate the bias by hand (or by Mathematica or SAGE). However there is an infinite number of distributions. We know that the "true" distribution will be discrete (the population in the world is finite). One can always do the math; the problem is finding the right counterexample (which can take several days).

It might be as you argue (that, say, choosing the normal might yeild a robust method) and it's a pretty interesting question: developing robust methods for the PAF (as the median is robust for the mean under Normal assumption). We have thought about it but we don't have the resources to pursue that right now; if you do please let us know what you find!

fcudhea commented 7 years ago

Yup. And that is the greatest assumption in any statistical method: that the data somehow reflect the reality you want. I don't think there is anything we can do about it do you?

None that are truly satisfactory, in my opinion. But we can be cautious about what kind of data to use for the method. If, say, we conclude that a single short-term recall does not reflect the true population distribution of mean annual intakes (observed standard deviation will be higher than "true" distribution's standard deviation), maybe using empirical method (or fitting data as is to distribution and using traditional method for that matter) is not the best approach and one could do better by using some ad-hoc probably-not-satisfactory method to shrink the data (or standard deviation estimate) to account for known bias. You could argue that it's still better to do a purely objective analysis and just be aware of potential biases rather than make muddled assumptions to control for bias too. I just wanted to point it out because I hadn't considered till relatively recently how much survey type can potentially weaken an analysis.

Maybe not. There is no way to say. If you assume your distributions are parametric you can always estimate the bias by hand (or by Mathematica or SAGE). However there is an infinite number of distributions. We know that the "true" distribution will be discrete (the population in the world is finite). One can always do the math; the problem is finding the right counterexample (which can take several days).

It might be as you argue (that, say, choosing the normal might yeild a robust method) and it's a pretty interesting question: developing robust methods for the PAF (as the median is robust for the mean under Normal assumption). We have thought about it but we don't have the resources to pursue that right now; if you do please let us know what you find!

As you might tell from my month a half late response, I don't have the time either. However, I need to correct you on one thing. I do NOT think the normal is a robust! If anything, I think it's fairly sensitive to specification. If anything, I think the gamma is robust for practical purposes. As for developing robust methods for PAF, I think your empirical method is exactly that. The only question (that neither of us has time to explore T.T) is how much more robust than current methods.

The original point for this thread was mainly on the salesmanship aspect of the example provided. Will a researcher think it's worth switching methods from that example? For me, the scenario seemed too unrealistic so I didn't buy it (but this is just me).

RodrigoZepeda commented 7 years ago

maybe using empirical method (or fitting data as is to distribution and using traditional method for that matter) is not the best approach and one could do better by using some ad-hoc probably-not-satisfactory method to shrink the data (or standard deviation estimate) to account for known bias

To me the best approach (idea for paper in 2020) is to use a Bayesian Empirical Method which could account for this bias. However, as that is currently out of our reach (at least of what our bosses want us to do) we are left with (I believe) three choices (feel free to add more):

  1. A parametric frequentist analysis
  2. A parametric Bayesian analysis
  3. The empirical method

Notice that for 1. and 2. it is necessary to either know the exact distribution of the data or to devise a statistical procedure to infer the distribution (which might end up being empirical too if choosing, say, a kernel).

In particular, for the Bayesian analysis what is always known is the a priori distribution of either the exposure or the PAF. However, if the exposure has also covariates researchers have to establish the joint distribution which can be difficult in practice.

Assuming all of this has been done the critique is the same that we have cited from (the Bayesian) Berger: choosing the a priori distribution for the variable of interest can strongly skew the result and that is why a priori distributions are chosen in the parameter space (hierarchical models).

All of this could be done in the presence of complete data (as you have suggested). We have compared by simulation results from the "accurate" Bayesian method and the empirical method and the difference in the results is negligible (< 0.001) so the empirical seems at least as accurate.

However, studies such as GBD only have the exposure distribution for the US (from NHANES) and assume the distribution is the same for the other 99 countries! If the Bayesian approach was taken one should either create a Bayesian hierarchical model for each country using the complete data from the country or a global Bayesian model hierarchical with geographic information. The information necessary for those models is not available (and it seems from the GBD article that it wasn't even available to the authors).

In that case where the Bayesian method is not possible I think the (maybe only?) approach is the empirical one. What do you think?

RodrigoZepeda commented 7 years ago

For me, the scenario seemed too unrealistic so I didn't buy it (but this is just me).

Thanks! We appreciate your inputs into the salesmanship of the method. We have changed the example and we'll probably send it to you by the end of the month.

I think I am speaking for both Dalia and myself in saying that we are desperate on finding how to sell the method. Any input you can give us is greatly appreciated.

RodrigoZepeda commented 7 years ago

After giving more thought to this, the Bayesian analysis should not be a problem provided we have an a posteriori distribution.

The main reason I am skeptical about the current method is that only an a priori distribution is used and the new information is not used to create an a posteriori (in the usual Bayesian sense using the Likelihood) .

What the method does is updating the a priori with the parameters of the a posteriori but there is no clear motive (at least to me) on the mathematical reasoning behind this. Can this be generalized? (I think not: as not all parametric families are conjugate ).

It might well be the case that for this problem the results are practically the same. But the possibility that they don't, that there is no proof or anything of that is what troubles me.

RodrigoZepeda commented 7 years ago

Could