PMassicotte / gtrendsR

R functions to perform and display Google Trends queries
353 stars 112 forks source link

Using gtrendsR for daily hits #412

Open marcyshieh opened 2 years ago

marcyshieh commented 2 years ago

Using the gtrendsR package and a modified version of Alex Dyachenko’s tutorial, I’ve been trying to query estimated Google Trends daily hits. I noticed that my modified version of Alex’s code doesn’t allow me to stop mid-month. In my modified version, all the days past the last day of the previous month show up as NA. Is there a way to resolve the issue?

In essence, I am just trying to replicate the steps in this Medium article but instead of doing monthly, I want to do an entire range of time.

Here's some replication code and the sample.xlsx file.

# daily estimates 

library(gtrendsR)
library(tidyverse)
library(lubridate)
library(readxl)
library(here)
library(stringr)
library(curl)

get_daily_gtrend <- function(keyword = c('Taylor Swift', 'Kim Kardashian'), geo = 'US', from = '2004-01-01', to = '2004-11-02') {
  if (ymd(to) >= floor_date(Sys.Date(), 'month')) {
    to <- floor_date(ymd(to), 'month') - days(1)

    if (to < from) {
      stop("Specifying \'to\' date in the current month is not allowed")
    }
  }

  aggregated_data <- gtrends(keyword = keyword, geo = geo, time = paste(from, to))
  if(is.null(aggregated_data$interest_over_time)) {
    print('There is no data in Google Trends!')
    return()
  }

  mult_m <- aggregated_data$interest_over_time %>%
    mutate(hits = as.integer(ifelse(hits == '<1', '0', hits))) %>%
    group_by(month = floor_date(date, 'month'), keyword) %>%
    dplyr::summarise(hits = sum(hits)) %>%
    ungroup() %>%
    mutate(ym = format(month, '%Y-%m'),
           mult = hits / max(hits)) %>%
    dplyr::select(month, ym, keyword, mult) %>%
    as_tibble()

  pm <- tibble(s = seq(ymd(from), ymd(to), by = 'month'), 
               e = seq(ymd(from), ymd(to), by = 'month') + months(1) - days(1))

  raw_trends_m <- tibble()

  for (i in seq(1, nrow(pm), 1)) {
    curr <- gtrends(keyword, geo = geo, time = paste(pm$s[i], pm$e[i]))
    if(is.null(curr$interest_over_time)) next
    print(paste('for', pm$s[i], pm$e[i], 'retrieved', count(curr$interest_over_time), 'days of data (all keywords)'))
    raw_trends_m <- rbind(raw_trends_m,
                          curr$interest_over_time)
  }

  trend_m <- raw_trends_m %>%
    dplyr::select(date, keyword, hits) %>%
    mutate(ym = format(date, '%Y-%m'),
           hits = as.integer(ifelse(hits == '<1', '0', hits))) %>%
    as_tibble()

  trend_res <- trend_m %>%
    left_join(mult_m) %>%
    mutate(est_hits = hits * mult) %>%
    dplyr::select(date, keyword, est_hits) %>%
    as_tibble() %>%
    mutate(date = as.Date(date))

  return(trend_res)
}

all <- read_excel("sample.xlsx")

all$Name <- trimws(all$Name)

all <- distinct(all)

all$surname <- str_extract(all$Name, '[^ ]+$')

all$surname <- trimws(all$surname)

all_j <- all %>%
  dplyr::select(Year, Folder) %>%
  distinct()

#####

cand2004 <- all %>% 
  arrange(Folder, str_count(Name, "\\w+"), nchar(Name)) %>%
  group_by(Folder, Year) %>%
  mutate(order = row_number()) %>%
  ungroup() 

cand2004 <- cand2004 %>%
  dplyr::select(Year, Folder, Name, order) %>%
  distinct() %>%
  separate(Folder, c("state", "name"), sep="\\-", extra = "merge")

cand2004_grp1 <- cand2004 %>%
  filter(Year == 2004, order == 1)

cand2004_grp1a <- split(cand2004_grp1,rep(1:20,each=5))

l <- cand2004_grp1a$`1` %>% dplyr::pull(Name) 

l <- as.list(unique(l))

r <- tibble()

for(k in l) {
  r <- r %>%
    rbind(get_daily_gtrend(keyword = k, geo = 'US', from = '2004-01-01', to = '2004-11-02'))
}

r %>% view()
JBleher commented 2 years ago

It is a problem how you loop over the dates. You can only download daily data for at moist 270 days.

The code you get builds queries for each month.

  pm <- tibble(s = seq(ymd(from), ymd(to), by = 'month'), 
               e = seq(ymd(from), ymd(to), by = 'month') + months(1) - days(1))

Also note that what you are doing makes the resulting time series hardly useful, since the queries are not comparable over time. You are stitching daily hits together which are standardized for the time frame in which you download the data.

See our paper: https://www.sciencedirect.com/science/article/abs/pii/S2452306221001210

This has nothing to do with the package, it is just how your code is written.

PMassicotte commented 2 years ago

This is also what Google Trends returns:

https://trends.google.com/trends/explore?date=2019-12-31%202020-11-01&geo=US&q=charles%20jones

image