Open Leszek-Sieminski opened 6 years ago
Thanks, I think this may be an issue where date
sequence only works for date ranges greater than 1, as byDate
breaks up the API calls - it should be the case that there is nothing to gain by using a date range of one day and byDate
.
That's interesting. I tried the same code as above with walk_data = c("byBatch")
. The result is the same, so it shouldn't be the case with "byDate".
This one day range is because the process tries to find missing data in database and download it. It tries to download data by each single day as I cannot be sure missing data will always be in a single period instead of "random" dates.
Nevertheless, I tried to download this data from a period to check if it solves the problem: from 2017-03-17 to 2017-03-19
> sc_data_4 <- searchConsoleR::search_analytics(
+ siteURL = address,
+ startDate = "2017-03-17",
+ endDate = "2017-03-19",
+ dimensions = c("date",
+ "device",
+ "page",
+ "query"),
+ searchType = 'web',
+ rowLimit = 20000,
+ prettyNames = FALSE,
+ aggregationType = "auto",
+ walk_data = NULL)
Fetching search analytics for url: XXX dates: 2017-03-17 2017-03-19 dimensions: date device page query dimensionFilterExp: searchType: web aggregationType: auto
Batching data via method: byBatch
With rowLimit set to 20000 will need up to [5] API calls
2018-07-17 10:25:36> Request #: 0 : 5000 : 10000
2018-07-17 10:25:42> Request #: 15000 : 20000
Warning message:
No data found for supplied dates - returning NA
However, it does not return similar number of rows. Quick check with table()
returns:
> table(sc_data_4$date)
2017-03-17 2017-03-18 2017-03-19
3715 447 3151
Today is the last day to check it (2017-03-17 is the oldest date in new Search Console) but as I see in Search Console there is more than 999 rows of data (queries + clicks & impressions) so it seems to be a mistake. I also tried with different rowLimit
's and different date ranges containing this 2017-03-18 but it always return rubbish (>500 rows). Any advice how to evade such problems?
Hi @MarkEdmondson1234 - think I'm having the same issue or a similar one as @Leszek-Sieminski-PM had.
Hopefully you can find the issue on my side.
#library(googleAuthR)
library(searchConsoleR)
## Authorize script with Google Developer Console.
options("searchConsoleR.client_id" = "XXX")
options("searchConsoleR.client_secret" = "XXX")
## data in search console is reliable for 3 days ago so set start date = today - 3
## this is a single day pull so set end = start
start <- Sys.Date() - 65
end <- Sys.Date() - 35
## set website to your URL including http://
website <- "https://www.domain.com"
## what to download, choose between data, query, page, device, country
download_dimensions <- c('date','page','query')
scr_auth()
## this is the query to the search console API
searchquery <- search_analytics(siteURL = website,
startDate = start,
endDate = end,
dimensions = download_dimensions,
walk_data = c("byDate"))
## Specify Output filepath
filepath <-"J:/SearchConsole/Exports/"
## filename will be set like searchconsoledata_2016-02-08 (.csv will be added in next step)
filename <- paste("searchconsoledata", start, sep = "_")
## the is the full filepath + filename with .csv
output <- paste(filepath, filename, ".csv", sep = "")
## this writes the sorted search query report to full filepath and filename row.names=FALSE does not write dataframe row numbers
write.csv(searchquery, output, row.names = FALSE)
## Complete
Fetching search analytics for url: https://www.domain.com dates: 2018-08-01 2018-08-31 dimensions: date page query dimensionFilterExp: searchType: web aggregationType: auto
Batching data via method: byDate
Will fetch up to 25000 rows per day
2018-10-05 14:34:35> Request #: 2018-08-01
Error in if (s[length(s)] == "") s <- s[-length(s)] :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In split_vector(r, index) : No index found
2: In split_vector(x, index, remove_splits = FALSE) : No index found
3: In split_vector(x, index, remove_splits = FALSE) : No index found
4: In split_vector(x, index, remove_splits = FALSE) : No index found
> traceback()
14: split_vector(x, index, remove_splits = FALSE)
13: unlist(split_vector(x, index, remove_splits = FALSE))
12: FUN(X[[i]], ...)
11: lapply(responses, function(x) {
index <- c(1:2)
unlist(split_vector(x, index, remove_splits = FALSE))
})
10: parseBatchResponse(req)
9: gar_batch(fl, ..., batch_endpoint = batch_endpoint)
8: FUN(X[[i]], ...)
7: lapply(limit_batch, function(y) {
if (length(limit_batch) > 1)
myMessage("Request #: ", paste(y, collapse = " : "),
level = 3)
fl <- lapply(y, function(x) {
pars_walk_list <- lapply(pars_walk, function(z) z = x)
names(pars_walk_list) <- pars_walk
path_walk_list <- lapply(path_walk, function(z) z = x)
names(path_walk_list) <- path_walk
body_walk_list <- lapply(body_walk, function(z) z = x)
names(body_walk_list) <- body_walk
if (length(pars_walk) > 0)
gar_pars <- modifyList(gar_pars, pars_walk_list)
if (length(path_walk) > 0)
gar_paths <- modifyList(gar_paths, path_walk_list)
if (length(body_walk) > 0)
the_body <- modifyList(the_body, body_walk_list)
f(pars_arguments = gar_pars, path_arguments = gar_paths,
the_body = the_body, batch = TRUE)
})
names(fl) <- as.character(y)
batch_data <- gar_batch(fl, ..., batch_endpoint = batch_endpoint)
if (!is.null(batch_function)) {
batch_data <- batch_function(batch_data)
}
batch_data
})
6: googleAuthR::gar_batch_walk(search_analytics_g, walk_vector = walk_vector,
gar_paths = list(sites = siteURL), body_walk = c("startDate",
"endDate"), the_body = body, batch_size = 1, dim = dimensions)
5: search_analytics(siteURL = website, startDate = start, endDate = end,
dimensions = download_dimensions, walk_data = c("byDate")) at google_search_console.R#23
4: eval(ei, envir)
3: eval(ei, envir)
2: withVisible(eval(ei, envir))
1: source("J:/SearchConsole/google_search_console.R",
echo = TRUE)
Latest searchConsoleR and googleAuthR from github, latest R Version
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] googleAuthR_0.6.3.9001 searchConsoleR_0.3.0.9000 timeDate_3043.102
loaded via a namespace (and not attached):
[1] httr_1.3.1 compiler_3.5.1 magrittr_1.5 R6_2.3.0 assertthat_0.2.0 tools_3.5.1
[7] curl_3.2 memoise_1.1.0 stringi_1.1.7 stringr_1.3.1 jsonlite_1.5 digest_0.6.17
[13] openssl_1.0.2
Sorry for closing issue, misclick.
@kirchnerto I later discovered that my database misses 46 days of data from last 16 months because of this issue. This problem does not appear if you use Python instead (for example: https://moz.com/blog/how-to-get-search-console-data-api-python)
It seems to me that the problem is the googleAuthR helper function.
@Leszek-Sieminski-PM Thanks for the tip! I still hope that @MarkEdmondson1234 has a clue, why this isn't working. I got back to older version of searchConsoleR and googleAuthR for now and extract that data on day-by-day basis without batching. Then I can get those 25000 rows I need.
I'll take a look, if I can get it to reproduce myself its a lot easier. I think its to do with some days not having data, so it needs to fail more gracefully. Does that sound possible?
As I undestand this issue, better error handling would be nice, but the real problem is that the data is present and available traditionally and through API (I downloaded it with both PHP and Python just to check) but R somehow cannot download some dates.
It looks like it downloads it, but the merge fails.
The original issue looks like it is before the raise from 5000 to 25000 in the API response
Fetching search analytics for url: 'XXX' dates: 2017-03-17 2017-03-17 dimensions: date device page query dimensionFilterExp: searchType: web aggregationType: auto
Batching data via method: byBatch
With rowLimit set to 6000 will need up to [2] API calls
2018-07-12 15:17:29> Batch API limited to [3] calls at once.
I suppose the issue should repeat though if you put in a rowLimit of 26000?
Hmm so the error rises here when parsing the batched responses metadata, not the data itself:
responses_meta <- lapply(responses, function(x){
index <- c(1:2)
unlist(split_vector(x, index, remove_splits = FALSE))
})
The API is sent through the batching service of Google, which lets you send many calls at once for faster response, e.g. it should now fetch 75000 rows per API call. The batch response is a split of all the separate API calls, however in this case no header information is being passed back, perhaps because those responses have no data at all.
If you can install the latest version of googleAuthR
now, it has better error messaging and on fail will write the batch response to a RDS file if you have options("googleAuthR.verbose" = 2)
You can open that file with readRDS()
and examine the object to see why the parsing is failing. I guess they are empty responses. Anyhow, please see if you can repeat the error and then make the .rds
object available to me or print out its output (it will be large though if big fetch, so please edit down if possible)
Hi @MarkEdmondson1234 - thanks for the fast response!
I installed the latest versions of searchConsoleR and googleAuthR and run the script again using options("googleAuthR.verbose" = 2)
> scr_auth()
2018-10-05 21:32:21>
options(googleAuthR.scopes.selected=c('https://www.googleapis.com/auth/webmasters'))
options(googleAuthR.client_id='858905045851-3beqpmsufml9d7v5d1pr74m9lnbueak2.apps.googleusercontent.com')
options(googleAuthR.client_secret=' bnmF6C-ScpSR68knbGrHBQrS')
options(googleAuthR.webapp.client_id='858905045851-iuv6uhh34fqmkvh4rq31l7bpolskdo7h.apps.googleusercontent.com')
options(googleAuthR.webapp.client_secret=' rFTWVq6oMu5ZgYd9e3sYu2tm')
Scopes: https://www.googleapis.com/auth/webmasters
App key: 858905045851-3beqpmsufml9d7v5d1pr74m9lnbueak2.apps.googleusercontent.com
Method: new_token
> ## this is the query to the search console API
> searchquery <- search_analytics(siteURL = website,
+ startDate = st .... [TRUNCATED]
Fetching search analytics for url: https://www.domain.de dates: 2018-08-01 2018-08-31 dimensions: date page query dimensionFilterExp: searchType: web aggregationType: auto
Batching data via method: byDate
Will fetch up to 25000 rows per day
2018-10-05 21:32:22> Batch API limited to [1] calls at once.
2018-10-05 21:32:22> Request #: 2018-08-01
2018-10-05 21:32:22> Token exists.
2018-10-05 21:32:22> Constructing batch request URL for: /webmasters/v3/sites/https%3A%2F%2Fwww.domain.de/searchAnalytics/query
2018-10-05 21:32:22> Making Batch API call
Error in value[[3L]](cond) :
Error with batch response - writing response to C:\Temp\RtmpyYqTyu\file298c37f072af.rds
I attached the .rds-file to this comment. Yeah - it's possible because the output is empty somehow. Hope this helps for debugging. file298c37f072af.zip
Ok well thats weird, the file works when I do it. Hmm, I hope its not a Windows thing.
@MarkEdmondson1234 What do you mean by saying the file works when I do it
?
One thing I noticed: The request only crashes when a lot of dimensions are set which lead to a hell of processing I guess. When just using the dimensions "date" and "query" or "date" and "page" it's working fine but crashes when I want to have all 3 dimensions. FYI: The result of leads to ~20k rows when requesting only one day . Any Ideas on this?
@MarkEdmondson1234 Anything new on this problem?
When I loaded the .rds
file it parsed without error on my machine, I don't know why it would work for me and not you unless its a Windows specific problem I can't easily test (I really hope it is not that). I need to be able to reproduce the problem to have a hope for fixing it. The amount of data should not be a problem unless you are running on a very very small machine? What is your RAM?
@MarkEdmondson1234 I'm running on Windows 10 with intel i5 and 8GB of RAM - should be enough I guess ;)
@MarkEdmondson1234 I'm running on Windows 10, intel i5, 8GB RAM (for development), but I started the issue after discovering missing data that was downloaded on server (Debian, 32 GB RAM). So it probably isn't related to OS, not sure about the RAM.
That should be plenty. Sorry I have no clue at the moment as it’s working on my test suite and locally.
Hello @MarkEdmondson1234, I have been experiencing the same issue as described above. I am also running R on Windows (inside RStudio), and I believe I can confirm that this is a Windows specific problem.
I have tried to lower the value of rowLimit
below 5,000, and also tried setting walk_data
to byDate
rather than byBatch
. In every case, my exports would end up failing with the same error, as described by OP.
However, since you mentioned that this could be a Windows specific problem, I tried running the exact same scripts using the rocker/verse image in Docker, and there you go: I never got any error and am now able to export all the data I need!
I hope this helps. Many thanks for your work.
Thanks @flopont thats very helpful. I will look to update using the latest googleAuthR tools that may help solves this.
Hi! This is my first issue so sorry for any mistakes or lacking info. I'll be glad to provide further info.
What goes wrong
First of all I'm afraid this error might not be fully reproducible and I'm sorry for that. I have set of dates and want to use them to download search console data (in a loop). Real examples:
Everything seems fine for all dates when I download with rowLimit <= 5000 and walk_data = c("byBatch). Increasing rowLimit above 5000 on "2017-03-17" works perfectly fine.
Unfortunately, increasing rowLimit on "2017-03-18" produces an error :
Error in if (s[length(s)] == "") s <- s[-length(s)]
It's strange because I checked manually data in Search Console and it seems that dates producing this error are normal - there is data for each one of them. I suppose this might be somehow connected to this particular website, but I cannot provide its address or tokens.
Code
Actual output
authetication
no problem ("2017-03-17" and rowLimit above 5000)
still no problem (changed date to "2017-03-18" and decreased rowLimit to 5000)
problem ("2018-03-18" and rowLimit > 5000)
Traceback
Session Info
In the beginning I used current versions of googleAuthR and searchConsoleR from CRAN. Changing to github version didn't solve the problem.