Open momo54 opened 11 months ago
hi @momo54! I just caught this - let me know if you're still interested in an answer and I'll investigate, a react is enough (has been awhile since I digged into the theory) cheers
Yes i'm still interested. Uniform sampling can be costly compared to independant draws...
hi @momo54 ! got a chance to revisit the paper and all the theoretical assumptions. Just to make sure I got your definitions right, I am defining
These estimators are specifically meant to address the issue of skewed distributions and thus violate the principle of uniform distribution. If your population is uniform - you can use good-turing estimator to get a very decent cardinality estimation as per https://math.stackexchange.com/questions/75758/estimate-the-size-of-a-set-from-which-a-sample-has-been-equiprobably-drawn and https://arxiv.org/pdf/1508.06216.pdf . all these estimators in the package are designed to handle skew by modelling the underlying skewness in distribution in some way and then handling the skewness to provide a better estimation. The tough part is because each estimator has different assumptions about the underlying data, it's hard to pick exactly the right one.
For example Chao's estimator uses f1/f2 ratio (singletons to doubletons) to figure out what the "slope" of the skewness looks like - https://palaeo-electronica.org/2011_1/238/estimate.htm
in the paper Peter Haas et al. (1996), it is written "Both the new and existing estimators described in this section are based on a sample of n tuples selected randomly and uniformly from R, without replacement; we call such a sample a simple random sample." My question is : does simple random sample is required for all estimators, for just independant draws is enough, especially for horvitz_thompson_estimator ?
so I think what Peter Haas et al. (1996) is trying to say is - the population has N distinct objects, and each object has a property d that is an element of a set D, and we uniformly and randomly pick one. This is not a uniform distribution of the property d because d can be highly skewed.
both ideas are somewhat different concepts in my opinion - one is a method of sampling (uniformly and randomly sampling), while another is an assumption about how each sample is done (independent draws)
It is not clear if uniform sampling is required for horvitz_thompson_estimator, or if just independant draws is enough ??