cjbarrie / academictwitteR

Repo for academictwitteR package to query the Twitter Academic Research Product Track v2 API endpoint.
Other
272 stars 59 forks source link

Can't query for an exact phrase #235

Closed ophiryotam closed 3 years ago

ophiryotam commented 3 years ago

Please confirm the following

Describe the bug

Hi there, When searching for a query with multiple words, wanting the exact phrase, I get all tweets containing any of the words. So when running the following I get tweets containing only the word "goals" and not the exact phrase. Thanks!

get_all_tweets(query = "goals of care", start_tweets = "2020-08-01T00:00:00Z", end_tweets = "2021-08-10T00:00:00Z", bearer_token=bearer_token, data_path = mypath, n=Inf, bind_tweets = F) I

Expected Behavior

Wanted to get the exact phrase "goals of care" instead got tweets with only "goals" or "care"

Steps To Reproduce

get_all_tweets(query = "goals of care", start_tweets = "2020-08-01T00:00:00Z", end_tweets = "2021-08-10T00:00:00Z", bearer_token=bearer_token, data_path = mypath, n=Inf, bind_tweets = F)

Environment

R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] tidytext_0.2.3 lubridate_1.7.4 forcats_0.4.0 stringr_1.4.0 dplyr_1.0.6
[6] purrr_0.3.3 tidyr_1.0.2 tibble_2.1.3 ggplot2_3.2.1 tidyverse_1.3.0
[11] ldatuning_1.0.0 topicmodels_0.2-9 quanteda_1.5.2 xlsx_0.6.1 readr_1.3.1
[16] academictwitteR_0.2.1

loaded via a namespace (and not attached): [1] httr_1.4.1 jsonlite_1.6 modelr_0.1.6 RcppParallel_4.4.4 assertthat_0.2.1 stats4_3.6.1
[7] xlsxjars_0.6.1 cellranger_1.1.0 slam_0.1-46 pillar_1.6.2 backports_1.1.5 lattice_0.20-38
[13] glue_1.4.2 RColorBrewer_1.1-2 rvest_0.3.5 colorspace_1.4-1 Matrix_1.2-17 tm_0.7-6
[19] pkgconfig_2.0.3 broom_0.5.5 haven_2.2.0 scales_1.1.0 generics_0.0.2 usethis_2.0.1
[25] ellipsis_0.3.2 withr_2.4.2 lazyeval_0.2.2 NLP_0.2-0 cli_2.5.0 magrittr_1.5
[31] crayon_1.3.4 readxl_1.3.1 tokenizers_0.2.1 janeaustenr_0.1.5 stopwords_1.0 fs_1.3.2
[37] fansi_0.4.0 SnowballC_0.6.0 nlme_3.1-140 xml2_1.2.2 tools_3.6.1 data.table_1.12.6 [43] hms_0.5.2 lifecycle_1.0.0 munsell_0.5.0 reprex_0.3.0 compiler_3.6.1 rlang_0.4.11
[49] grid_3.6.1 rstudioapi_0.11 gtable_0.3.0 curl_4.3 DBI_1.1.0 R6_2.4.1
[55] utf8_1.1.4 fastmatch_1.1-0 cld2_1.2.1 modeltools_0.2-22 rJava_0.9-11 stringi_1.4.3
[61] parallel_3.6.1 Rcpp_1.0.3 vctrs_0.3.8 spacyr_1.2 wordcloud_2.6 dbplyr_1.4.2
[67] tidyselect_1.1.1

Anything else?

No response

chainsawriot commented 3 years ago

@ophiryotam

224

chainsawriot commented 3 years ago

@cjbarrie Maybe we should add this to the README or whatnot.

cjbarrie commented 3 years ago

@chainsawriot yep I'll get on this. I feel like this could also be a case where we could add an argument forexact_phrase or something too, which would coerce the character vector into escape quotes. I know we haven't included arguments like this for other similar cases, but I think this is a bit different from problems re misunderstandings of AND and OR logics. Plus, escape quotes are ugly and error prone and it'd be nice to hide them under the hood!

cjbarrie commented 3 years ago

@chainsawriot yep I'll get on this. I feel like this could also be a case where we could add an argument forexact_phrase or something too, which would coerce the character vector into escape quotes. I know we haven't included arguments like this for other similar cases, but I think this is a bit different from problems re misunderstandings of AND and OR logics. Plus, escape quotes are ugly and error prone and it'd be nice to hide them under the hood!

Have now done this in #242, which adds exact_phrase parameter

chainsawriot commented 3 years ago
chainsawriot commented 3 years ago

@cjbarrie Thanks for implementing the feature. I am in the process of writing tests for your added feature and I am afraid the feature is not well-tested.

Let's say a slightly more advanced example in the README. Suppose I want to search for the exact phrase of "Black Lives Matter" and those retweeted from "@ACLU". The current implementation will generate this query: \"Black Lives Matter (retweets_of:ACLU)\" and (retweets_of:ACLU) is part of the "exact phrase". This query will surely give no result.

The problem is from the order of when "exact_phrase" is treated in build_query. I believe it should be the first, not the fifth. But please let me know how you think, maybe I have misunderstood something.

require(academictwitteR)
#> Loading required package: academictwitteR
tweets1 <-
  get_all_tweets(
    query = "Black Lives Matter",
    retweets_of = "ACLU",
    exact_phrase = TRUE, start_tweets = "2021-01-04T00:00:00Z", 
    end_tweets = "2021-01-04T00:45:00Z", 
    n = Inf)
#> Warning: Recommended to specify a data path in order to mitigate data loss when
#> ingesting large amounts of data.
#> Warning: Tweets will not be stored as JSONs or as a .rds file and will only be
#> available in local memory if assigned to an object.
#> query:  "Black Lives Matter (retweets_of:ACLU)" 
#> Total pages queried: 1 (tweets captured this page: 0).
#> This is the last page for "Black Lives Matter (retweets_of:ACLU)" : finishing collection.

tweets2 <- get_all_tweets(query = "\"Black Lives Matter\"", retweets_of = "ACLU",
                          start_tweets = "2021-01-04T00:00:00Z", end_tweets = "2021-01-04T00:45:00Z", 
                          n = Inf)
#> Warning: Recommended to specify a data path in order to mitigate data loss when
#> ingesting large amounts of data.

#> Warning: Tweets will not be stored as JSONs or as a .rds file and will only be
#> available in local memory if assigned to an object.
#> query:  "Black Lives Matter" (retweets_of:ACLU) 
#> Total pages queried: 1 (tweets captured this page: 110).
#> This is the last page for "Black Lives Matter" (retweets_of:ACLU) : finishing collection.

nrow(tweets1)
#> [1] 0
nrow(tweets2)
#> [1] 110

testthat::expect_true(nrow(tweets1) > 0)
#> Error: nrow(tweets1) > 0 is not TRUE
#> 
#> `actual`:   FALSE
#> `expected`: TRUE
testthat::expect_true(nrow(tweets2) > 0)

build_query(query = "Black Lives Matter", exact_phrase = TRUE, retweets_of = "ACLU")
#> [1] "\"Black Lives Matter (retweets_of:ACLU)\""

Created on 2021-10-17 by the reprex package (v2.0.0)

cjbarrie commented 3 years ago

@chainsawriot you are right. Thank you for spotting this. I have implemented a change in #247