Sampling terminology is incorrect

skolenik commented 1 year ago

I am a sampling statistician. Your terminology is not compatible with any of the existing sampling books.

What you call complete random sampling is hard to name, but the term you came up with is not used in sampling work. What you call simple random sampling is called Poisson sampling. The definition of simple random sampling is that all samples of size n from the finite population of size N have equal probabilities of selection. Equal probabilities of selection of individual units is not a sign of SRS; when I taught sampling to grad students, an exercise I used to have in the sampling section of the survey statistics class was to came up with multiple designs that are equal probability but not SRS (1 for C -- you need to know at least one to understand what SRS is vs. what it isn't, 3 designs for B, 5 different designs for A).

I understand that you are deep in development with a lot of parts that have already been documented, exposed, put on the website, published, etc. Still... this is fundamentals.

A standard introductory reference on sampling is Lohr (3rd edition 2022). The next level is probably Kish (1965). The graduate level textbooks are Valliant Dever Kreuter (2018) accompanied by library(PracTools), and Tille 2006 accompanied by library(sampling) written by his student.

A technical problem that you have is that the unequal probability sampling you use is not a proper probability proportional to size (PPS) sample: you can simulate and see that probabilities of selection are not maintained. The problem is with the underlying base::sample(); proper procedures are implemented in library(sampling) by Alina Matei and Yves Tille, e.g. sampling::UPmaxentropy().

# example from Tille (2006)
set.seed(25)
# sum of probabilities == expected sample size == 3L
uneq_p = c(0.07,0.17,0.41,0.61,0.83,0.91)
sim_uneq_p_base <- as.data.frame(matrix(rep(0,6*20000),ncol=6))
for (k in 1:nrow(sim_uneq_p_base)) {
  this <- sample.int(n=length(uneq_p),size=sum(uneq_p),prob=uneq_p)
  sim_uneq_p_base[k,this] <- 1
}
colSums(sim_uneq_p_base)/nrow(sim_uneq_p_base)
uneq_p
# done right
sim_uneq_p_done_right <- as.data.frame(matrix(rep(0,6*20000),ncol=6))
for (k in 1:nrow(sim_uneq_p_done_right)) {
   sim_uneq_p_done_right[k,] <- sampling::UPmaxentropy(uneq_p)
}
colSums(sim_uneq_p_done_right)/nrow(sim_uneq_p_done_right)
uneq_p

nfultz commented 1 year ago

Just wanted to say thank you for your note.

I'm not entirely sure / don't remember where the names came from, they were already there when I worked on it in 2017 or so.

I think one of the desires was to use the same words for both random sampling and for random assignment when the procedures are analogous, but maybe that's not correct either, they probably just have different words. I'm pretty sure the assignment functions were written first.

I also think that if the sampling package does the more correct thing and has roughly the same API, there's a good argument to just use that and retire the old randomizr package, but I'm not very familiar with the sampling package myself. I don't have any skin in the game, just a thought.

EDIT:

I just checked Gerber and Green, pretty sure they started with simple and complete random assignment from chapter 2, then added functions using that nomenclature for sampling.

macartan commented 1 year ago

@neal yes that's where this terminology comes from here;
@stas it is indeed awkward that the complete_rs as implemented here is simple random sampling

I believe randomizr does not currently attempt a fixed sample size when unequal probabilities are specified (prob_unit cannot vary in complete_rs); however simple_rs does appear to get the right distribution with your example:

> replicate(1000000, 
+           randomizr::simple_rs(6, prob_unit = c(0.07,0.17,0.41,0.61,0.83,0.91))) |>
+     apply(1, mean) |> round(3)
[1] 0.070 0.171 0.410 0.610 0.831 0.910

so not clear to me there is a problem here for what's advertized

@stas thanks for the references and suggestions

DeclareDesign / randomizr

Sampling terminology is incorrect #99