MarkEdmondson1234 / searchConsoleR

R interface with Google Search Console API v3, including Search Analytics.
http://code.markedmondson.me/searchConsoleR/
Other
114 stars 42 forks source link

Idea - rate limiting #68

Open Leszek-Sieminski opened 3 years ago

Leszek-Sieminski commented 3 years ago

Hi Mark! Thanks a lot for your awesome package!

Actually I'm not experiencing a bug per se, but I might have an idea for improvement, namely rate limiting.

What goes wrong

I wrote a wrapper functions for {searchConsoleR} function search_analytics(). Unfortunately, despite my efforts they sometimes exceed API limits due to using multiple requests while using "byBatch" (I provide an example below).

This is nasty because it depends solely on the website of which I download the data - the bigger the website the more data -> the more data the more requests on batch level -> errors.

I believe that the script exceeds the API limits because of batching by 25k rows - there's no mechanism that slows down sending the requests on this level. My own functions have some very primitive rate limiting via Sys.sleep(x) but this on a different level and it won't help (unless using extremely big values but it will render downloading very uneffective...).

So I began wondering about using rate limiting (for example this one: https://github.com/tarakc02/ratelimitr) inside the package's function. This is not something I can address without editing your solution on search_analytics() level, so I thought that I ask first - what do you think about implementing some form of rate limiting?

Steps to reproduce the problem

gsc_download_1_day <- function(
  url_website, 
  date_downloaded, 
  dim1 = NULL, 
  dim2 = NULL) {

  gsc_data <- searchConsoleR::search_analytics(
    siteURL   = url_website,
    startDate = date_downloaded,
    endDate   = date_downloaded,
    dimensions = c("date", 
                   "device", 
                   dim1, 
                   dim2), 
    searchType = c("web"), 
    rowLimit   = 50000,
    walk_data  = c("byBatch")
  )

  Sys.sleep(3)
  return(gsc_data)
}

gsc_download_period <- function(
  url_website, 
  vec_of_dates, 
  dim1 = NULL, 
  dim2 = NULL) {

  x <- purrr::map_dfr(
    .x = vec_of_dates,
    .f = gsc_download_1_day,
    url_website = url_website,
    dim1 = dim1, 
    dim2 = dim2)

  return(x)
}

# some url
url <- "https://example.com"

# dates for which I download the data
vec_of_dates <- c("2020-01-01", ..., "2021-05-09")

df_gsc_data <- gsc_download_period(
      url_website     = url,
      vec_of_dates    = vec_of_dates,
      dim1            = "query",
      dim2            = NULL)

Expected output

# finished data frame "df_gsc_data"

Actual output

# (...)  
# Fetching search analytics for url: some_url dates: 2020-03-11 2020-03-11 dimensions: date device query dimensionFilterExp:  searchType: web aggregationType: auto
# Batching data via method: byBatch
# With rowLimit set to 1e+05 will need up to [5] API calls
# Page [1] of max [5] API calls
# Downloaded 25000 rows
# Page [2] of max [5] API calls
# Downloaded 16804 rows
# Page [3] of max [5] API calls
# Fetching search analytics for url: some_url dates: 2020-03-12 2020-03-12 dimensions: date device query dimensionFilterExp:  searchType: web aggregationType: auto
# Batching data via method: byBatch
# With rowLimit set to 1e+05 will need up to [5] API calls
# Page [1] of max [5] API calls
# ℹ 2021-05-13 09:13:21 > Request Status Code:  403
# Error: API returned: Search Analytics load quota exceeded. Learn about usage limits: https://developers.google.com/webmaster-tools/v3/limits.

Session Info

R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] reticulate_1.18      data.table_1.13.6    doParallel_1.0.16    iterators_1.0.13     foreach_1.5.1       
 [6] glue_1.4.2           dplyr_1.0.2          RMySQL_0.10.21       DBI_1.1.0            odbc_1.3.0          
[11] purrr_0.3.4          googleAuthR_1.3.1    searchConsoleR_0.4.0 futile.logger_1.4.3 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5           compiler_4.0.5       pillar_1.4.7         formatR_1.7          futile.options_1.0.1
 [6] tools_4.0.5          digest_0.6.27        bit_4.0.4            lattice_0.20-41      jsonlite_1.7.2      
[11] memoise_1.1.0        gargle_0.5.0         lifecycle_0.2.0      tibble_3.0.4         pkgconfig_2.0.3     
[16] rlang_0.4.10         Matrix_1.3-2         cli_2.2.0            curl_4.3             httr_1.4.2          
[21] askpass_1.1          rappdirs_0.3.3       generics_0.1.0       fs_1.5.0             vctrs_0.3.6         
[26] hms_0.5.3            grid_4.0.5           bit64_4.0.5          tidyselect_1.1.0     R6_2.5.0            
[31] fansi_0.4.1          lambda.r_1.2.4       blob_1.2.1           magrittr_2.0.1       codetools_0.2-18    
[36] ellipsis_0.3.1       assertthat_0.2.1     renv_0.13.1          stringi_1.5.3        openssl_1.4.3       
[41] crayon_1.3.4        
Leszek-Sieminski commented 3 years ago

Hi! I took my freedom and rewrote your function to include my primitive rate limiting approach (search for "experimental addition" in the code). So far it seems working, but I think a more sophisticated approach would certainly speed things up.

gsc_low_level_download <- function(
  siteURL,
  startDate = Sys.Date() - 93,
  endDate = Sys.Date() - 3,
  dimensions = NULL,
  searchType = c("web", "video", "image"),
  dimensionFilterExp = NULL,
  aggregationType = c("auto", "byPage", "byProperty"),
  rowLimit = 1000,
  prettyNames = TRUE,
  walk_data = c("byBatch", "byDate", "none"))
{
  if (!googleAuthR::gar_has_token()) {
    stop("Not authenticated. Run scr_auth()", call. = FALSE)
  }
  searchType <- match.arg(searchType)
  aggregationType <- match.arg(aggregationType)
  walk_data <- match.arg(walk_data)
  startDate <- as.character(startDate)
  endDate <- as.character(endDate)
  message("Fetching search analytics for ", paste("url:",
                                                  siteURL, "dates:", startDate, endDate, "dimensions:",
                                                  paste(dimensions, collapse = " ", sep = ";"), "dimensionFilterExp:",
                                                  paste(dimensionFilterExp, collapse = " ", sep = ";"),
                                                  "searchType:", searchType, "aggregationType:", aggregationType))
  siteURL <- searchConsoleR:::check.Url(siteURL, reserved = T)
  if (any(is.na(as.Date(startDate, "%Y-%m-%d")), is.na(as.Date(endDate,
                                                               "%Y-%m-%d")))) {
    stop("dates not in correct %Y-%m-%d format. Got these:",
         startDate, " - ", endDate)
  }
  if (any(as.Date(startDate, "%Y-%m-%d") > Sys.Date() - 3,
          as.Date(endDate, "%Y-%m-%d") > Sys.Date() - 3)) {
    warning("Search Analytics usually not available within 3 days (96 hrs) of today(",
            Sys.Date(), "). Got:", startDate, " - ", endDate)
  }
  if (!is.null(dimensions) && !dimensions %in% c("date", "country",
                                                 "device", "page", "query", "searchAppearance")) {
    stop("dimension must be NULL or one or more of 'date','country', 'device', 'page', 'query', 'searchAppearance'.\n         Got this: ",
         paste(dimensions, sep = ", "))
  }
  if (!searchType %in% c("web", "image", "video")) {
    stop("searchType not one of \"web\",\"image\",\"video\".  Got this: ",
         searchType)
  }
  if (!aggregationType %in% c("auto", "byPage", "byProperty")) {
    stop("aggregationType not one of \"auto\",\"byPage\",\"byProperty\". Got this: ",
         aggregationType)
  }
  if (aggregationType %in% c("byProperty") && "page" %in%
      dimensions) {
    stop("Can't aggregate byProperty and include page in dimensions.")
  }
  if (walk_data == "byDate") {
    message("Batching data via method: ", walk_data)
    message("Will fetch up to 25000 rows per day")
    rowLimit <- 25000
  }
  else if (walk_data == "byBatch") {
    if (rowLimit > 25000) {
      message("Batching data via method: ", walk_data)
      message("With rowLimit set to ", rowLimit, " will need up to [",
              (rowLimit%/%25000) + 1, "] API calls")
      rowLimit0 <- rowLimit
      rowLimit <- 25000
    }
    else {
      walk_data <- "none"
    }
  }
  parsedDimFilterGroup <- lapply(dimensionFilterExp, searchConsoleR:::parseDimFilterGroup)
  body <- list(
    startDate = startDate,
    endDate = endDate,
    dimensions = as.list(dimensions),
    searchType = searchType,
    dimensionFilterGroups = list(list(
      groupType = "and",
      filters = parsedDimFilterGroup)),
    aggregationType = aggregationType,
    rowLimit = rowLimit)

  search_analytics_g <- googleAuthR::gar_api_generator(
    "https://www.googleapis.com/webmasters/v3/",
    "POST", path_args = list(sites = "siteURL", searchAnalytics = "query"),
    data_parse_function = searchConsoleR:::parse_search_analytics)

  options(googleAuthR.batch_endpoint = "https://www.googleapis.com/batch/webmasters/v3")
  if (walk_data == "byDate") {
    if (!"date" %in% dimensions) {
      warning("To walk data per date requires 'date' to be one of the dimensions. Adding it")
      dimensions <- c("date", dimensions)
    }
    walk_vector <- seq(as.Date(startDate), as.Date(endDate), 1)

    out <- googleAuthR::gar_batch_walk(search_analytics_g, walk_vector = walk_vector,
                          gar_paths = list(sites = siteURL), body_walk = c("startDate",
                                                                           "endDate"), the_body = body, batch_size = 1,
                          dim = dimensions)
  }
  else if (walk_data == "byBatch") {
    walk_vector <- seq(0, rowLimit0, 25000)
    do_it <- TRUE
    i <- 1
    pages <- list()
    while (do_it) {
      message("Page [", i, "] of max [", length(walk_vector),
              "] API calls")
      this_body <- utils::modifyList(body, list(startRow = walk_vector[i]))
      this_page <- search_analytics_g(the_body = this_body,
                                      list(sites = siteURL), dim = dimensions)

      # experimental addition ###########################################################################################
      Sys.sleep(3)
      # #################################################################################################################

      if (all(is.na(this_page[[1]]))) {
        do_it <- FALSE
      }
      else {
        message("Downloaded ", nrow(this_page), " rows")
        pages <- rbind(pages, this_page)
      }
      i <- i + 1
      if (i > length(walk_vector)) {
        do_it <- FALSE
      }
    }
    out <- pages
  }
  else {
    out <- search_analytics_g(the_body = body, path_arguments = list(sites = siteURL),
                              dim = dimensions)
  }
  out
}
MarkEdmondson1234 commented 3 years ago

Thanks this is great. I have noticed some rate limiting recently too which I think is a new thing. I'd like to get an idea on exactly what are the limits so the pauses are not too long if possible - other Google APIs have things like 1000 requests per 100 second but I can't find any documentation for the Search Console one.

Leszek-Sieminski commented 3 years ago

I didn't know much about it until recently too, especially in R. Found this package by accident month ago or something :) Never used it so far, but looks promising.

Regarding the GSC API quotas, I've found something like this for v3:

Per-site quota (calls querying the same site):

  • 50 QPS
  • 1,200 QPM

Per-user quota (calls made by the same user):

  • 50 QPS
  • 1,200 QPM

Per-project quota (calls made using the same Developer Console key):

  • 100,000,000 QPD

Source: https://developers.google.com/webmaster-tools/search-console-api-original/v3/limits

I think that it should be possible to implement some form of rate limiting for per-site and per-user quotas as those two are being explicitly used in the R code.

On the other hand this would require some form of elasticity & control for the user, because it's totally possible that user's limits are being used by two different tools: one in R and the second not (so the limiting wouldn't work properly).

Still, I believe the "greedy mode" that assumes that there are no other such processes would prove very useful for users so hopefully you will consider it as worth trying :)

MarkEdmondson1234 commented 3 years ago

Hmm but the limiting I've seen is a lot more restrictive than those given above - it says it will allow 20 API calls per second is the per-user and per-site quota limit. I've had to add a good 10 seconds sometimes between some of my scheduled calls.

We are way under the global per-project quota (I see ~200,000 requests in last 30 days) so I don't think people using their own client.id will help either.

The errors look to have increased recently from May 2nd - before then peaks of 0.6 queries per second were fine, but after that date more queries are failing

Screenshot 2021-05-19 at 08 35 49
Leszek-Sieminski commented 3 years ago

OK, this is indeed very strange. If your credentials leaked, it would be seen in the traffic plot I think, so it looks rather like problems on Google's infrastructure side. Still, it would be nice if they issued a warning or at least inform that there is an issue and that they are addressing it somehow

MarkEdmondson1234 commented 3 years ago

Yes it looks like some changes from May 2nd, I've flagged it up to them. Its not a case of leaked credentials, all users use the credentials above that come with the package (unless they go out of their way to specify their own via googleAuthR::gar_set_client()) but its not an issue usually since the GCP project isn't connected to any paid resources or billing account and up until recently no issue with quotas.

Leszek-Sieminski commented 3 years ago

From my observation the errors start to appear when above 0.10 - 0.13 queries per second