Open Leszek-Sieminski opened 3 years ago
Hi! I took my freedom and rewrote your function to include my primitive rate limiting approach (search for "experimental addition" in the code). So far it seems working, but I think a more sophisticated approach would certainly speed things up.
gsc_low_level_download <- function(
siteURL,
startDate = Sys.Date() - 93,
endDate = Sys.Date() - 3,
dimensions = NULL,
searchType = c("web", "video", "image"),
dimensionFilterExp = NULL,
aggregationType = c("auto", "byPage", "byProperty"),
rowLimit = 1000,
prettyNames = TRUE,
walk_data = c("byBatch", "byDate", "none"))
{
if (!googleAuthR::gar_has_token()) {
stop("Not authenticated. Run scr_auth()", call. = FALSE)
}
searchType <- match.arg(searchType)
aggregationType <- match.arg(aggregationType)
walk_data <- match.arg(walk_data)
startDate <- as.character(startDate)
endDate <- as.character(endDate)
message("Fetching search analytics for ", paste("url:",
siteURL, "dates:", startDate, endDate, "dimensions:",
paste(dimensions, collapse = " ", sep = ";"), "dimensionFilterExp:",
paste(dimensionFilterExp, collapse = " ", sep = ";"),
"searchType:", searchType, "aggregationType:", aggregationType))
siteURL <- searchConsoleR:::check.Url(siteURL, reserved = T)
if (any(is.na(as.Date(startDate, "%Y-%m-%d")), is.na(as.Date(endDate,
"%Y-%m-%d")))) {
stop("dates not in correct %Y-%m-%d format. Got these:",
startDate, " - ", endDate)
}
if (any(as.Date(startDate, "%Y-%m-%d") > Sys.Date() - 3,
as.Date(endDate, "%Y-%m-%d") > Sys.Date() - 3)) {
warning("Search Analytics usually not available within 3 days (96 hrs) of today(",
Sys.Date(), "). Got:", startDate, " - ", endDate)
}
if (!is.null(dimensions) && !dimensions %in% c("date", "country",
"device", "page", "query", "searchAppearance")) {
stop("dimension must be NULL or one or more of 'date','country', 'device', 'page', 'query', 'searchAppearance'.\n Got this: ",
paste(dimensions, sep = ", "))
}
if (!searchType %in% c("web", "image", "video")) {
stop("searchType not one of \"web\",\"image\",\"video\". Got this: ",
searchType)
}
if (!aggregationType %in% c("auto", "byPage", "byProperty")) {
stop("aggregationType not one of \"auto\",\"byPage\",\"byProperty\". Got this: ",
aggregationType)
}
if (aggregationType %in% c("byProperty") && "page" %in%
dimensions) {
stop("Can't aggregate byProperty and include page in dimensions.")
}
if (walk_data == "byDate") {
message("Batching data via method: ", walk_data)
message("Will fetch up to 25000 rows per day")
rowLimit <- 25000
}
else if (walk_data == "byBatch") {
if (rowLimit > 25000) {
message("Batching data via method: ", walk_data)
message("With rowLimit set to ", rowLimit, " will need up to [",
(rowLimit%/%25000) + 1, "] API calls")
rowLimit0 <- rowLimit
rowLimit <- 25000
}
else {
walk_data <- "none"
}
}
parsedDimFilterGroup <- lapply(dimensionFilterExp, searchConsoleR:::parseDimFilterGroup)
body <- list(
startDate = startDate,
endDate = endDate,
dimensions = as.list(dimensions),
searchType = searchType,
dimensionFilterGroups = list(list(
groupType = "and",
filters = parsedDimFilterGroup)),
aggregationType = aggregationType,
rowLimit = rowLimit)
search_analytics_g <- googleAuthR::gar_api_generator(
"https://www.googleapis.com/webmasters/v3/",
"POST", path_args = list(sites = "siteURL", searchAnalytics = "query"),
data_parse_function = searchConsoleR:::parse_search_analytics)
options(googleAuthR.batch_endpoint = "https://www.googleapis.com/batch/webmasters/v3")
if (walk_data == "byDate") {
if (!"date" %in% dimensions) {
warning("To walk data per date requires 'date' to be one of the dimensions. Adding it")
dimensions <- c("date", dimensions)
}
walk_vector <- seq(as.Date(startDate), as.Date(endDate), 1)
out <- googleAuthR::gar_batch_walk(search_analytics_g, walk_vector = walk_vector,
gar_paths = list(sites = siteURL), body_walk = c("startDate",
"endDate"), the_body = body, batch_size = 1,
dim = dimensions)
}
else if (walk_data == "byBatch") {
walk_vector <- seq(0, rowLimit0, 25000)
do_it <- TRUE
i <- 1
pages <- list()
while (do_it) {
message("Page [", i, "] of max [", length(walk_vector),
"] API calls")
this_body <- utils::modifyList(body, list(startRow = walk_vector[i]))
this_page <- search_analytics_g(the_body = this_body,
list(sites = siteURL), dim = dimensions)
# experimental addition ###########################################################################################
Sys.sleep(3)
# #################################################################################################################
if (all(is.na(this_page[[1]]))) {
do_it <- FALSE
}
else {
message("Downloaded ", nrow(this_page), " rows")
pages <- rbind(pages, this_page)
}
i <- i + 1
if (i > length(walk_vector)) {
do_it <- FALSE
}
}
out <- pages
}
else {
out <- search_analytics_g(the_body = body, path_arguments = list(sites = siteURL),
dim = dimensions)
}
out
}
Thanks this is great. I have noticed some rate limiting recently too which I think is a new thing. I'd like to get an idea on exactly what are the limits so the pauses are not too long if possible - other Google APIs have things like 1000 requests per 100 second but I can't find any documentation for the Search Console one.
I didn't know much about it until recently too, especially in R. Found this package by accident month ago or something :) Never used it so far, but looks promising.
Regarding the GSC API quotas, I've found something like this for v3:
Per-site quota (calls querying the same site):
- 50 QPS
- 1,200 QPM
Per-user quota (calls made by the same user):
- 50 QPS
- 1,200 QPM
Per-project quota (calls made using the same Developer Console key):
- 100,000,000 QPD
Source: https://developers.google.com/webmaster-tools/search-console-api-original/v3/limits
I think that it should be possible to implement some form of rate limiting for per-site and per-user quotas as those two are being explicitly used in the R code.
On the other hand this would require some form of elasticity & control for the user, because it's totally possible that user's limits are being used by two different tools: one in R and the second not (so the limiting wouldn't work properly).
Still, I believe the "greedy mode" that assumes that there are no other such processes would prove very useful for users so hopefully you will consider it as worth trying :)
Hmm but the limiting I've seen is a lot more restrictive than those given above - it says it will allow 20 API calls per second is the per-user and per-site quota limit. I've had to add a good 10 seconds sometimes between some of my scheduled calls.
We are way under the global per-project quota (I see ~200,000 requests in last 30 days) so I don't think people using their own client.id will help either.
The errors look to have increased recently from May 2nd - before then peaks of 0.6 queries per second were fine, but after that date more queries are failing
OK, this is indeed very strange. If your credentials leaked, it would be seen in the traffic plot I think, so it looks rather like problems on Google's infrastructure side. Still, it would be nice if they issued a warning or at least inform that there is an issue and that they are addressing it somehow
Yes it looks like some changes from May 2nd, I've flagged it up to them. Its not a case of leaked credentials, all users use the credentials above that come with the package (unless they go out of their way to specify their own via googleAuthR::gar_set_client()
) but its not an issue usually since the GCP project isn't connected to any paid resources or billing account and up until recently no issue with quotas.
From my observation the errors start to appear when above 0.10 - 0.13 queries per second
Hi Mark! Thanks a lot for your awesome package!
Actually I'm not experiencing a bug per se, but I might have an idea for improvement, namely rate limiting.
What goes wrong
I wrote a wrapper functions for {searchConsoleR} function
search_analytics()
. Unfortunately, despite my efforts they sometimes exceed API limits due to using multiple requests while using "byBatch" (I provide an example below).This is nasty because it depends solely on the website of which I download the data - the bigger the website the more data -> the more data the more requests on batch level -> errors.
I believe that the script exceeds the API limits because of batching by 25k rows - there's no mechanism that slows down sending the requests on this level. My own functions have some very primitive rate limiting via Sys.sleep(x) but this on a different level and it won't help (unless using extremely big values but it will render downloading very uneffective...).
So I began wondering about using rate limiting (for example this one: https://github.com/tarakc02/ratelimitr) inside the package's function. This is not something I can address without editing your solution on
search_analytics()
level, so I thought that I ask first - what do you think about implementing some form of rate limiting?Steps to reproduce the problem
Expected output
# finished data frame "df_gsc_data"
Actual output
Session Info