cjbarrie / academictwitteR

Repo for academictwitteR package to query the Twitter Academic Research Product Track v2 API endpoint.
Other
272 stars 59 forks source link

[FR] Auto-Splitting long Queries #321

Open TimBMK opened 2 years ago

TimBMK commented 2 years ago

Describe the solution you'd like

For large numbers of e.g. user_ids or conversation_ids that need to be passed to the Twitter API, simple loops with only one ID per API call are highly inefficient. Combining a larger number of IDs into a single query speeds the process up significantly, but can be cumbersome since one needs to adhere to the 1024 character limit. Splitting the IDs into query chunks by hand can be inefficient for large datasets and IDs can significantly vary in length (think old vs. new accounts). I think it could be very convenient if academictwitteR took over the splitting of the queries, either as a dedicated function or standard functionality for get_all_tweets() (and similar functions).

A while ago, I wrote myself a very makeshift convenience function to split the queries into larger chunks and pass them to the Twitter API. I'm sure there's a more elegant way to do it (I could imagine a functionality to automatically pick optimal batch sizes rather than setting them by hand). Do you think that'd be a helpful addition for one of the next releases or is there already something similar in the works?

Anything else?

Here's the makeshift function I wrote a while back


query_splitter <- function(input, type = c("users", "conversations", "tweets"), batch, batchsize = 1024, start_tweets, end_tweets, n, bind = F, data_path, bearer_token) {

  require(academictwitteR)
  require(stringr)

  type <- match.arg(type)

  # make index
  data <- data.frame(input = input, index = 1:length(input))

  # checkup loop (are the queries too long?)
  for (i in seq(1, length(data$input), batch)) {

    lookup <- na.omit(data$input[i:(i+(batch-1))])

    if (type == "users"){
      query <- build_query(users = str_trim(lookup)) # str_trim just in case
      if(nchar(query) > batchsize) {
        stop(paste("Query too long:", nchar(query), "characters. Adjust batch size."))
      }
    } 

    if (type == "conversations"){
      query <- build_query(conversation_id = str_trim(lookup))

      # currently, academictwitteR's behaviour differs for covnersation_ids when building the query, hence the different evaluation. Might need fixing later (if changed)
      # Note that this loop is only meant to emulate the query-build behaviour in academictwitteR and does not have any influence on the actual query
      full_query <- paste("(", paste(query, collapse = " OR "), 
                          ")", sep = "")
      if(nchar(full_query) > batchsize) {
        stop(paste("Query too long:", nchar(full_query), "characters. Adjust batch size."))
      }
    }

    if (type == "tweets"){
      query <- paste0("ids=", paste(input, collapse = ", "))
    }

  }

  # actual loop
  for (i in seq(1, length(data$input), batch)) {

    # indicator
    cat(paste("\nBatch", ceiling(i/batch), "/", ceiling(length(data$input)/batch), 
              "\nInput row numbers", i, "to", (i+(batch-1)), "\n" ))

    lookup <- na.omit(data$input[i:(i+(batch-1))])

    if (type == "users"){
      query <- build_query(users = str_trim(lookup))} # str_trim just in case

    if (type == "conversations"){
      query <- build_query(conversation_id = str_trim(lookup))
    }

    try(
    get_all_tweets(query = query, start_tweets = start_tweets, 
                   end_tweets = end_tweets, n = n, 
                   data_path = data_path, bind_tweets = F,
                   bearer_token = bearer_token)
    )
  }

  # binding (if asked for). Note that this binds all .jsons in the specified folder, not only the the output of the split query. This is only a convenience option!
  if (bind == T) {

    output <- bind_tweets(data_path = data_path, output_format = "tidy")

    return(output)
  }

}

### Test it
test_ids <- c("1504044853512052740", "1504044835799457796", "1504044817315033092", 
"1504044815675269120", "1504044789758574593", "1503397457870344192", 
"1504044722846932996", "1504044699107180550", "1504044623374831627", 
"1504044583977725953", "1504044529195888649", "1504044507456802816", 
"1504044385067016196", "1503430145897635844", "1504044362501607424", 
"1504044357585936387", "1503998516791848960", "1504044298349826049", 
"1504044296839872512", "1504044253206523908", "1504044175444033541", 
"1504044160906584064", "1504034436215717891", "1504044034507091970", 
"1504044019059470347", "1504044000097050626", "1503691679177617410", 
"1504043942387535879", "1504043923748102145", "1504043919138607106", 
"1504043900238979084", "1504043860183371778", "1504043857104846850", 
"1504043803031842819", "1504043736489238534", "1504043705681985538", 
"1504043699042504713", "1504043609837953031", "1504043557560147968", 
"1504043521782829057", "1504043498495950849", "1504037624356483072", 
"1504043490153488398", "1504043433886900227", "1504043404686155776", 
"1504043357336702980", "1504043340345524227", "1504043296280227842", 
"1503985668569108481", "1504043041384022023", "1504042992394457090", 
"1504042943446929412", "1504042817525592064", "1504042805135556610", 
"1504042740425838598", "1504042734369353730", "1504042728052731911", 
"1504042721434087433", "1504042683786010627", "1504042625556525056"
)

test <- query_splitter(
  test_ids,
  batch = 20,
  type = "conversations",
  start_tweets = "2022-02-24T00:00:00Z",
  end_tweets = "2022-04-24T00:00:00Z",
  n = Inf,
  data_path = "data/conversations",
  bearer_token = bearer_token,
  bind = T
)