chanedwin / pydistinct

Statistical estimators to predict a population's cardinality https://pydistinct.readthedocs.io/
MIT License
20 stars 1 forks source link

uniform sampling required for horvitz_thompson_estimator ? #4

Open momo54 opened 11 months ago

momo54 commented 11 months ago

It is not clear if uniform sampling is required for horvitz_thompson_estimator, or if just independant draws is enough ??

chanedwin commented 10 months ago

hi @momo54! I just caught this - let me know if you're still interested in an answer and I'll investigate, a react is enough (has been awhile since I digged into the theory) cheers

momo54 commented 10 months ago

Yes i'm still interested. Uniform sampling can be costly compared to independant draws...

chanedwin commented 10 months ago

hi @momo54 ! got a chance to revisit the paper and all the theoretical assumptions. Just to make sure I got your definitions right, I am defining

These estimators are specifically meant to address the issue of skewed distributions and thus violate the principle of uniform distribution. If your population is uniform - you can use good-turing estimator to get a very decent cardinality estimation as per https://math.stackexchange.com/questions/75758/estimate-the-size-of-a-set-from-which-a-sample-has-been-equiprobably-drawn and https://arxiv.org/pdf/1508.06216.pdf . all these estimators in the package are designed to handle skew by modelling the underlying skewness in distribution in some way and then handling the skewness to provide a better estimation. The tough part is because each estimator has different assumptions about the underlying data, it's hard to pick exactly the right one.

chanedwin commented 10 months ago

For example Chao's estimator uses f1/f2 ratio (singletons to doubletons) to figure out what the "slope" of the skewness looks like - https://palaeo-electronica.org/2011_1/238/estimate.htm

momo54 commented 10 months ago

in the paper Peter Haas et al. (1996), it is written "Both the new and existing estimators described in this section are based on a sample of n tuples selected randomly and uniformly from R, without replacement; we call such a sample a simple random sample." My question is : does simple random sample is required for all estimators, for just independant draws is enough, especially for horvitz_thompson_estimator ?

chanedwin commented 10 months ago

so I think what Peter Haas et al. (1996) is trying to say is - the population has N distinct objects, and each object has a property d that is an element of a set D, and we uniformly and randomly pick one. This is not a uniform distribution of the property d because d can be highly skewed.

chanedwin commented 10 months ago

both ideas are somewhat different concepts in my opinion - one is a method of sampling (uniformly and randomly sampling), while another is an assumption about how each sample is done (independent draws)