facebookarchive / ml_sampler

Model assisted random sampling.
Other
121 stars 26 forks source link

Why is biased sampling not returning unique samples #3

Open yifudiao opened 7 years ago

yifudiao commented 7 years ago

https://github.com/facebookincubator/ml_sampler/blob/bbed79cf0926ea7957a99ff99cd5b84b9c933662/ml_sampler/biased_sample.py#L57

Argument replace=True in np.random.choice makes the samples not unique, is this intended to make each draw independent?

If I change to replace=False, is the prevalence estimate still accurate?

spencebeecher commented 7 years ago

@yifudiao - so sorry for the delay! Sampling without replacement has a different set of assumptions and it definitely changes the math. Changing that line will make the prevalence estimate not accurate. You would have to change your estimator for Horvitz-Thompson to Hansen-Hurwitz.

yifudiao commented 7 years ago

@spencebeecher Thanks for the reply!

ml sampler currently implements Hansen-Hurwitz estimator(from code comment). Do you mean changing estimator from Hansen-Hurwitz to Horvitz-Thompson?

After I do that, do I need to change estimated_variance() and estimated_confidence_interval() function?

spencebeecher commented 7 years ago

If you want to do that you would have to re-work the math (including confidence interval estimates). Alternatively, you could take a larger sample but de-dupe records when you review them (then apply that review to all records with the same ID).