gen_data
--------------------------------------------------------------------------------
This section is for generating a fake dataset to test out the code
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -
```{r gen_data}
## set seed
set.seed(50)
## Number of locations to select from
n <- 20
## Prefix
prefix <- "location "
##Suffix
suffix <- seq(1:n)
## Combine to create basic cluster selection dataset
clusters <- data.frame(location_name = paste0(prefix, suffix),
location_population = sample(1000:25000, n, replace = TRUE))
read_data
--------------------------------------------------------------------------------
This section is for importing your actual location and population data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -->
```{r read_data, warning = FALSE, message = FALSE}
### Read in location and population data ---------------------------------------------------------------
## Excel file ------------------------------------------------------------------
## read in location data sheet
# clusters <- rio::import(here::here("03 Sampling files", "cluster_data.xlsx"),
# na = ".")
identify_clusters
--------------------------------------------------------------------------------
This section is to specify or calculate the following:
- total population in the survey area
- the number of clusters for the survey
- the sampling interval, which is the total population divided by the number of clusters in the survey
- the random starting point
These figures will be combined together in a for loop to obtain a list of the clusters to be surveyed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -->
```{r identify_clusters}
## Set seed to ensure the random start remains the same each time
set.seed(50)
## Calculate total population
total_pop <- sum(clusters$location_population, na.rm = T)
## Calculate cumulative sum of the population
clusters$cum_sum <- cumsum(clusters$location_population)
## Specify the number of clusters
cluster_number <- 10
## Calculate sampling interval and round it up
sampling_interval <- round(total_pop/cluster_number, digits = 0)
## Select a random starting point between 1 and the sampling interval
random_start <- sample(1:sampling_interval,1)
## This for loop will identify the locations to survey
for (i in 1:length(clusters$cum_sum)) {
if (i == 1) {
clusters$number_clusters[i] = as.integer(((clusters$cum_sum[i] - random_start)/(sampling_interval) +1))
clusters$cum_clusters[i] = clusters$number_clusters
} else {
clusters$number_clusters[i] = as.integer((((clusters$cum_sum[i] - random_start)/(sampling_interval) +1) - clusters$cum_clusters[i-1]), digits = 0)
clusters$cum_clusters[i] = clusters$number_clusters[i] + clusters$cum_clusters[i-1]
}
}
At MSF, we have an Excel tool that supports identification of clusters with probability proportional to size, but this can also be done in R
A first attempt at doing this with a sample dataset included for testing purposes I've validated this using 2 datasets - previously used one from MSF activities and from this WHO doc https://www.who.int/tb/advisory_bodies/impact_measurement_taskforce/meetings/prevalence_survey/psws_probability_prop_size_bierrenbach.pdf