Open yifudiao opened 7 years ago
@yifudiao - so sorry for the delay! Sampling without replacement has a different set of assumptions and it definitely changes the math. Changing that line will make the prevalence estimate not accurate. You would have to change your estimator for Horvitz-Thompson to Hansen-Hurwitz.
@spencebeecher Thanks for the reply!
ml sampler currently implements Hansen-Hurwitz estimator(from code comment). Do you mean changing estimator from Hansen-Hurwitz to Horvitz-Thompson?
After I do that, do I need to change estimated_variance() and estimated_confidence_interval() function?
If you want to do that you would have to re-work the math (including confidence interval estimates). Alternatively, you could take a larger sample but de-dupe records when you review them (then apply that review to all records with the same ID).
https://github.com/facebookincubator/ml_sampler/blob/bbed79cf0926ea7957a99ff99cd5b84b9c933662/ml_sampler/biased_sample.py#L57
Argument
replace=True
innp.random.choice
makes the samples not unique, is this intended to make each draw independent?If I change to
replace=False
, is the prevalence estimate still accurate?