appliedepi / epiRhandbook_eng

The repository for the English version of the Epidemiologist R Handbook
Other
99 stars 56 forks source link

Add to the survey section a component on selection of clusters with probability proportional to size #88

Open pbkeating opened 2 years ago

pbkeating commented 2 years ago

At MSF, we have an Excel tool that supports identification of clusters with probability proportional to size, but this can also be done in R

A first attempt at doing this with a sample dataset included for testing purposes I've validated this using 2 datasets - previously used one from MSF activities and from this WHO doc https://www.who.int/tb/advisory_bodies/impact_measurement_taskforce/meetings/prevalence_survey/psws_probability_prop_size_bierrenbach.pdf

gen_data 
--------------------------------------------------------------------------------
This section is for generating a fake dataset to test out the code
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -

```{r gen_data}
## set seed
set.seed(50)

## Number of locations to select from
n <- 20

## Prefix
prefix <- "location "

##Suffix 
suffix <- seq(1:n)

## Combine to create basic cluster selection dataset
clusters <- data.frame(location_name = paste0(prefix, suffix),
                       location_population = sample(1000:25000, n, replace = TRUE))
read_data 
--------------------------------------------------------------------------------
This section is for importing your actual location and population data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -->

```{r read_data, warning = FALSE, message = FALSE}

### Read in location and population data ---------------------------------------------------------------

## Excel file ------------------------------------------------------------------
## read in location data sheet
# clusters  <- rio::import(here::here("03 Sampling files", "cluster_data.xlsx"), 
#                                na = ".")
identify_clusters
--------------------------------------------------------------------------------
This section is to specify or calculate the following:
- total population in the survey area
- the number of clusters for the survey
- the sampling interval, which is the total population divided by the number of clusters in the survey
- the random starting point

These figures will be combined together in a for loop to obtain a list of the clusters to be surveyed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -->
```{r identify_clusters}
## Set seed to ensure the random start remains the same each time
set.seed(50)

## Calculate total population
total_pop <- sum(clusters$location_population, na.rm = T)

## Calculate cumulative sum of the population
clusters$cum_sum <- cumsum(clusters$location_population)

## Specify the number of clusters
cluster_number <- 10

## Calculate sampling interval and round it up
sampling_interval <- round(total_pop/cluster_number, digits = 0)

## Select a random starting point between 1 and the sampling interval
random_start <- sample(1:sampling_interval,1)

## This for loop will identify the locations to survey
for (i in 1:length(clusters$cum_sum)) {
  if (i == 1) {
    clusters$number_clusters[i] = as.integer(((clusters$cum_sum[i] - random_start)/(sampling_interval) +1))
    clusters$cum_clusters[i] = clusters$number_clusters
  } else {
    clusters$number_clusters[i] = as.integer((((clusters$cum_sum[i] - random_start)/(sampling_interval) +1) - clusters$cum_clusters[i-1]), digits = 0)
    clusters$cum_clusters[i] = clusters$number_clusters[i] + clusters$cum_clusters[i-1]
  }
}
aspina7 commented 2 years ago

@AlexandreBlake just for info - if get round to sampling

AlexandreBlake commented 2 years ago

Thanks @pbkeating ! I was planning to generate data for a sampling frame at some point. 1 less thing to do.

aspina7 commented 2 months ago

for https://github.com/appliedepi/epiRhandbook_eng/pull/102