WorldHealthOrganization / godata-r-reports

R scripts connected to Go.Data API
MIT License
7 stars 2 forks source link

R Reporting - API Scripts - functions to create loops for batched import of records #1

Open sarahollis opened 3 years ago

sarahollis commented 3 years ago

To avoid API timeouts, need to adapt scripts to loop through records according to pre-defined chunk sizes. Example below of rationale:

chunk_size = 10000 data_instance <- pull_data_func() base_url <- "url" chunk_num <- 0 dat_out <- data_instance

while (nrow(data_instance) == chunk_size) {

temp_url <- paste0(base_url, "$skip=chunk_num$limit=", chunk_size) data_instance <- get(temp_url) data_out <- bind_rows(data_out, data_instance) chunk_num <- chunk_num + chunk_size

}

jamesfuller-cdc commented 3 years ago

In the James branch, published 8/4/2021,

  1. get total number of records using the 'count' API endpoint
  2. create an empty data frame
  3. define your 'batch size'
  4. using a 'while' loop, iterate through and download records in batches, then append those new records to the existing data frame. The GET call must incorporate both limit and skip filters.

Remaining issues: This still times out with extremely large datasets. Test instance has over 1 million follow-up records, and this iterative process still times out after 800 or 900 thousand records. The resulting error code is 524, and I've reached out to Clarisoft via JIRA to get some help.

###################################################################################################
# GET CASES
###################################################################################################

#get total number of cases
cases_n <- GET(paste0(url,"api/outbreaks/",outbreak_id,"/cases/count"), 
               add_headers(Authorization = paste("Bearer", get_access_token(), sep = " "))) %>%
  content(as="text") %>% fromJSON(flatten=TRUE) %>% unlist() %>% unname()

#Import Cases in batches 
cases <- tibble()
batch_size <- 50000 # number of records to import per iteration
skip <-0
while (skip < cases_n) {
  message("********************************")
  message(paste0("Importing records ", as.character(skip+1, scientific = FALSE), " to ", format(skip+batch_size, scientific = FALSE)))
  cases.i <- GET(paste0(url,"api/outbreaks/",outbreak_id,"/cases",
                      "/?filter={%22limit%22:",format(batch_size, scientific = FALSE),",%22skip%22:",format(skip, scientific = FALSE),"}"), 
               add_headers(Authorization = paste("Bearer", get_access_token(), sep = " "))) %>%
    content(as='text') %>%
    fromJSON( flatten=TRUE) %>%
    as_tibble()
  message(paste0("Imported ", format(nrow(cases.i), scientific = FALSE)," records"))
  cases <- cases %>% bind_rows(cases.i)
  skip <- skip + batch_size
  message(paste0("Data Frame now has ", format(nrow(cases), scientific = FALSE), " records"))
  rm(cases.i)
}
rm(batch_size, skip, cases_n)