PMassicotte / gtrendsR

R functions to perform and display Google Trends queries
352 stars 112 forks source link

Inconsistent results when scraping the same timerange, place, keywords multiple times #428

Open annika-stechemesser opened 1 year ago

annika-stechemesser commented 1 year ago

Hello,

I used gtrendsR to scrape for two searchwords with an "or" connection (covid+corona) in one place (geocode US-CT-533) for the timerange 2020-04-01 2021-07-01. Here is my line of code:

local_trends=gtrends(keyword='covid+corona',geo=local_geo,time ="2020-04-01 2021-07-01")$interest_over_time

I ran it multiple times and noticed that every time I got different results (see plot below). How is this possible given that none of the parameters changed and the timerange is in the past? Also none of the versions I got with gtrends exactly match the data I see in the browser when I put these inputs in the search.

Can you explain what is going on here and advise me how to get the correct data?

Thanks very much!

image

eddelbuettel commented 1 year ago

~I think we have seen this before and it is explained as 'well they reserve the right to answer that way' as what we hit is not a fully defined API :-/ Maybe Google subsamples, and you found a query that shows that?~ Edit: Never mind!

But I better let @PMassicotte chime in...

PMassicotte commented 1 year ago

That is strange, I can not reproduce the problem on my side.

library(gtrendsR)
library(ggplot2)

l <- list()

v <- 1:6
for (i in v) {

  df <- gtrends(keyword='covid+corona',geo="US",time ="2020-04-01 2021-07-01")$interest_over_time
  df$run <- paste("Run#", i)

  l[[i]] <- df
}

df <- do.call(rbind, l)

ggplot(df, aes(x = date, y = hits, color = run)) +
  geom_line()

Created on 2022-08-29 with reprex v2.0.2

PMassicotte commented 1 year ago

New try using your exact same GEO code:

library(gtrendsR)
library(ggplot2)

l <- list()

v <- 1:6
for (i in v) {

  df <- gtrends(keyword='covid+corona',geo="US-CT-533",time ="2020-04-01 2021-07-01")$interest_over_time
  df$run <- paste("Run#", i)

  l[[i]] <- df
}

df <- do.call(rbind, l)

ggplot(df, aes(x = date, y = hits, color = run)) +
  geom_line()

Created on 2022-08-29 with reprex v2.0.2

PMassicotte commented 1 year ago

@annika-stechemesser Can you try my code and see if you have the same results?

annika-stechemesser commented 1 year ago

If I run your code and loop through multiple scrapes without wait they match up, however my graph looks slightly different to yours for example. I ran my various scrapes with a larger time delay between them, maybe that's it? I will try to run them spread out over a few hours and see what I get. Thanks a lot for the help!

image

JBleher commented 1 year ago

Google provides the folliwng information: https://support.google.com/trends/answer/4365533?hl=en According to Google, there are two types of samples one can access:

  1. “Real-time data is a sample covering the last seven days.”
  2. “Non-real-time data is a separate sample from real-time data and goes as far back as 2004 and up to 36 hours before your search

Appendix B of https://www.sciencedirect.com/science/article/abs/pii/S2452306221001210 may be an interesting read as well.

JBleher commented 1 year ago

Also the medium article by Simon Rogers is telling: https://medium.com/google-news-lab/what-is-google-trends-data-and-what-does-it-mean-b48f07342ee8

Our hypothesis is that samples from the full Google Trends dataset are not retaken for each query. However, we suspect that the sample taken from the full dataset could be based on an in-memory database somewhere on a Google Trends server instance, so that queries to Google Trends can be processed faster. If different IP addresses are routed to different instances there might be different in-memory samples that give different results. Also, if instances are shutdown, renewed, or the routing of traffic changes, the in-memory database may have to be resampled from the full Google Trends dataset. We therefore assume that the result from Google Trends does not depend on the IP address per se. More precisely, we think it depends on the instance your query is routed to. This would also explain the inconsistencies in Google Trends data reported across time by Behnen, Kessler, Kruse, Schoenmakers, Zerr, and Gómez (2020), since in modern Cloud service instances are scaled up and down dynamically, depending on traffic.

annika-stechemesser commented 1 year ago

Thank you @JBleher these comments have been really helpful. Running the code with a ~24h break gave different timeseries (see below). The same run in a non-delayed loop still gives the same data. I am not sure what to do with that stastistically but it does not seem to be a problem with gTrends. Thank you!

image

JBleher commented 1 year ago

On a positive note, the time series you are querying seems to be calculated on enough search volume, so that the variation induced by different samples is rather small.

annika-stechemesser commented 1 year ago

Do you have any advice on how to force getting into a new batch? I tried changing IP address and deleting cookies manually in the browser but none of that worked so far. I would just like to see a bunch of variation for my request but am pretty unclear on how to get it... Does the cookie-URL parameter have anything to do with it? Thank you!

JBleher commented 1 year ago

You may be able to use different servers in different locations. Lists of free proxy-servers that you could use can be found on the internet. Or you could use the TOR network. However, be aware that some servers may be used by other people to circumvent rate limits. So you will still need to slow down the requests and have some try and catch logic to handle potentially empty data sets...