cjbarrie / academictwitteR

Repo for academictwitteR package to query the Twitter Academic Research Product Track v2 API endpoint.
Other
272 stars 59 forks source link

[BUG] Reply/Mention Queries fail unexplainably for certain accounts & time periods #311

Closed schliebs closed 2 years ago

schliebs commented 2 years ago

Please confirm the following

Describe the bug

When querying replies to/mentions of certain highly replied to accounts (e.g. @mfa_russia), I get the error message "too many errors". Note that this does not happen for other accounts (e.g. "to:JoeBiden" or reply_to = "JoeBiden") or for shorter time periods. Also, this error does not occur when running the same API requests manually, so it must be a package-level issue. I have included a manual implementation of the same queries where the error does not occur below the reproducible examples (example 1b and 2b).

Expected Behavior

Return the replies/mentions as usual. I ran the exact same queries through the API manually and did not get any three-digit errors (only 200 responses), so this seems to be something happening within the academictwitteR package.

Steps To Reproduce

Example 1 (Reply to):

library(tidyverse)
library(academictwitteR)
library(lubridate)

bearer_token = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

xx <-
  get_all_tweets(
    reply_to = "RussianEmbassy",
    start_tweets = "2022-03-01T00:00:00Z",
    end_tweets   =  "2022-03-09T23:59:59Z",
    data_path = paste0("test/"),
    n = Inf,
    bearer_token = bearer_token
  )
#> Warning: Tweets will be bound in local memory as well as stored as JSONs.
#> query:   (to:RussianEmbassy)
#> Error in make_query(url = endpoint_url, params = params, bearer_token = bearer_token, : Too many errors.

Created on 2022-03-18 by the reprex package (v2.0.1)

Example 2 (mentions):

library(tidyverse)
library(academictwitteR)
library(lubridate)

bearer_token = "xxxxxxx"

xx <-
  get_all_tweets(
    query = "@mfa_russia",
    start_tweets = "2022-02-01T00:00:00Z",
    end_tweets   =  "2022-03-09T23:59:59Z",
    data_path = paste0("test/"),
    n = Inf,
    bearer_token = bearer_token
  )
#> Warning: Tweets will be bound in local memory as well as stored as JSONs.
#> query:  @mfa_russia
#> Error in make_query(url = endpoint_url, params = params, bearer_token = bearer_token, : Too many errors.

Created on 2022-03-18 by the reprex package (v2.0.1)

Example 1b (working manual implementation):

bearer_token = "xxxxxxxxxxxx"
url <- "https://api.twitter.com/2/tweets/search/all"

headers = c(
  `Authorization` = sprintf('Bearer %s', bearer_token)
)

params <- list(`start_time` = "2022-02-01T00:00:00Z",
              `end_time` = "2022-03-09T23:59:59Z",
              `query` = " (to:RussianEmbassy)")

httr::GET(url, 
          httr::add_headers(.headers=headers), 
          query = params)
#> Response [https://api.twitter.com/2/tweets/search/all?start_time=2022-02-01T00%3A00%3A00Z&end_time=2022-03-09T23%3A59%3A59Z&query=%20%28to%3ARussianEmbassy%29]
#>   Date: 2022-03-18 20:39
#>   Status: 200
#>   Content-Type: application/json; charset=utf-8
#>   Size: 2.21 kB

Created on 2022-03-18 by the reprex package (v2.0.1)

Example 2b (working manual implementation):

bearer_token = "xxxxxxxxxxxxxxxx"
url <- "https://api.twitter.com/2/tweets/search/all"

headers = c(
  `Authorization` = sprintf('Bearer %s', bearer_token)
)

params <- list(`start_time` = "2022-02-01T00:00:00Z",
              `end_time` = "2022-03-09T23:59:59Z",
              `query` = " @mfa_russia")

httr::GET(url, 
          httr::add_headers(.headers=headers), 
          query = params)
#> Response [https://api.twitter.com/2/tweets/search/all?start_time=2022-02-01T00%3A00%3A00Z&end_time=2022-03-09T23%3A59%3A59Z&query=%20%40mfa_russia]
#>   Date: 2022-03-18 20:41
#>   Status: 200
#>   Content-Type: application/json; charset=utf-8
#>   Size: 2.08 kB

Created on 2022-03-18 by the reprex package (v2.0.1)

Environment

R version 4.1.3 (2022-03-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.4 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods
[7] base

other attached packages: [1] lubridate_1.8.0 academictwitteR_0.3.1 [3] forcats_0.5.1 stringr_1.4.0
[5] dplyr_1.0.8 purrr_0.3.4
[7] readr_2.1.2 tidyr_1.2.0
[9] tibble_3.1.6 ggplot2_3.3.5
[11] tidyverse_1.3.1

loaded via a namespace (and not attached): [1] Rcpp_1.0.8 cellranger_1.1.0 pillar_1.7.0
[4] compiler_4.1.3 dbplyr_2.1.1 tools_4.1.3
[7] bit_4.0.4 jsonlite_1.8.0 lifecycle_1.0.1 [10] gtable_0.3.0 pkgconfig_2.0.3 rlang_1.0.2
[13] reprex_2.0.1 rstudioapi_0.13 DBI_1.1.2
[16] cli_3.2.0 curl_4.3.2 parallel_4.1.3
[19] haven_2.4.3 xml2_1.3.3 withr_2.5.0
[22] httr_1.4.2 fs_1.5.2 generics_0.1.2
[25] vctrs_0.3.8 hms_1.1.1 bit64_4.0.5
[28] grid_4.1.3 tidyselect_1.1.2 glue_1.6.2
[31] R6_2.5.1 fansi_1.0.2 readxl_1.3.1
[34] vroom_1.5.7 tzdb_0.2.0 modelr_0.1.8
[37] magrittr_2.0.2 usethis_2.1.5 backports_1.4.1 [40] scales_1.1.1 ellipsis_0.3.2 rvest_1.0.2
[43] assertthat_0.2.1 colorspace_2.0-2 utf8_1.2.2
[46] stringi_1.7.6 munsell_0.5.0 broom_0.7.12
[49] crayon_1.5.0

Anything else?

No response

chainsawriot commented 2 years ago

@schliebs Thanks for reporting this. I can give a plausible explanation.

I run this query.

xx <-
  get_all_tweets(
    query = "RussianEmbassy",
    start_tweets = "2022-03-01T00:00:00Z",
    end_tweets   =  "2022-03-09T23:59:59Z",
    n = 100)

But I run it with: debug(academictwitteR:::make_query) and look at the response code for each individual query. It fails around 60% of the time in this Sunday afternoon and for those failed cases, the response code is 503 (overcapacity). As it is an accepted code and academictwitteR will retry with that. When it reached max_error (default to 4 tries), it gave "Too many errors".

https://github.com/cjbarrie/academictwitteR/blob/2809432aaea388e7bb016a1f15f24787e8d05586/R/utils.R#L15-L30

It is more likely to fail with queries related to the invasion. For some relatively harmless things such as #ichbinhanna, it is always 200. This is just a hypothesis: Twitter restricted their API capacity based on the query.

Whether or not this counts as "a package-level issue" is of course debatable (one can blame the four-strikes rule, but that's not the root cause). But please try to break down the query into small pieces. At this time all over the internet, any thing can easy be overcapacity.

chainsawriot commented 2 years ago

@schliebs One way to increase the success rate is to reduce the page_n from the default 500 to 100.

xx <-
  get_all_tweets(
    reply_to = "RussianEmbassy",
    start_tweets = "2022-03-01T00:00:00Z",
    end_tweets   =  "2022-03-09T23:59:59Z",
    n = Inf,
    page_n = 100)

It will be > 5x slower, though.