[FR] Add "reply_count" column in the "bind_tweets(data_path = "tweetdata", output_format = "tidy") "

Jacobzwj commented 2 years ago

Describe the solution you'd like

Thanks for this wonderful package!

I have a question about the tidy format output:

We could get tidy format by bind_tweets(data_path = , output_format = "tidy"), but the result only contains "retweet_count", "like_count" and "quote_count" (see the following picture). I cannot find the "reply_count", which is also a very useful column for researchers. Thus, I raised this issue. I wonder is there a way to find "reply_count" in the tidy output_format?

Thanks again, Jacob

Anything else?

chainsawriot commented 2 years ago

@Qiegao1994 Thanks for raising this. And the tidy format is opinionated and thus it doesn't include the reply_count by default. You may see the relevant code here.

https://github.com/cjbarrie/academictwitteR/blob/5ed91fff98294f2d554d74aa1a911dc23d62c684/R/bind_tweets.R#L101-L104

If you need that data (now), I recommend converting the data first to 'raw' and then join it back to the tidy dataframe.

require(academictwitteR)
#> Loading required package: academictwitteR
require(tidyverse)
#> Loading required package: tidyverse
temp_path <- academictwitteR:::.gen_random_dir()
get_all_tweets("data @twitterdev", start_tweets = "2020-12-01T00:00:00Z", end_tweets = "2021-01-01T00:00:00Z", is_retweet = FALSE, data_path = temp_path, bind_tweets = FALSE, verbose = FALSE)
x <- bind_tweets(temp_path, output_format = "tidy")
y <- bind_tweets(temp_path, output_format = "raw")
y$tweet.public_metrics.reply_count
#> # A tibble: 7 × 2
#>   tweet_id             data
#>   <chr>               <int>
#> 1 1339351148050796544     0
#> 2 1339037915280654337     0
#> 3 1338309611946991621     0
#> 4 1338221705358143489     0
#> 5 1336432113244114944     0
#> 6 1334620555668963328     0
#> 7 1334596974318903297     1
y$tweet.public_metrics.reply_count %>% rename(reply_count = "data") %>% left_join(x, by = "tweet_id")
#> # A tibble: 7 × 32
#>   tweet_id     reply_count user_username text  created_at conversation_id source
#>   <chr>              <int> <chr>         <chr> <chr>      <chr>           <chr> 
#> 1 13393511480…           0 shreerangp    "@Tw… 2020-12-1… 13393502089421… Twitt…
#> 2 13390379152…           0 SawavehVezhh… "@ge… 2020-12-1… 13390379152806… Twitt…
#> 3 13383096119…           0 graylanj      "@gg… 2020-12-1… 13381473230970… Twitt…
#> 4 13382217053…           0 chRSBGREEN    "Tha… 2020-12-1… 13380293740298… Twitt…
#> 5 13364321132…           0 Tuhung16      "@Th… 2020-12-0… 13338020140613… Twitt…
#> 6 13346205556…           0 Brodie9992    "&lt… 2020-12-0… 13346205556689… Twitt…
#> 7 13345969743…           1 jamie_maguir… "@Tw… 2020-12-0… 13345644888848… Twitt…
#> # … with 25 more variables: lang <chr>, in_reply_to_user_id <chr>,
#> #   possibly_sensitive <lgl>, author_id <chr>, user_name <chr>,
#> #   user_verified <lgl>, user_profile_image_url <chr>, user_description <chr>,
#> #   user_url <chr>, user_location <chr>, user_created_at <chr>,
#> #   user_protected <lgl>, user_pinned_tweet_id <chr>, retweet_count <int>,
#> #   like_count <int>, quote_count <int>, user_tweet_count <int>,
#> #   user_list_count <int>, user_followers_count <int>, …

^{Created on 2022-02-22 by the reprex package (v2.0.1)}

Jacobzwj commented 2 years ago

@Qiegao1994 Thanks for raising this. And the tidy format is opinionated and thus it doesn't include the reply_count by default. You may see the relevant code here.

https://github.com/cjbarrie/academictwitteR/blob/5ed91fff98294f2d554d74aa1a911dc23d62c684/R/bind_tweets.R#L101-L104

If you need that data (now), I recommend converting the data first to 'raw' and then join it back to the tidy dataframe.

require(academictwitteR)
#> Loading required package: academictwitteR
require(tidyverse)
#> Loading required package: tidyverse
temp_path <- academictwitteR:::.gen_random_dir()
get_all_tweets("data @twitterdev", start_tweets = "2020-12-01T00:00:00Z", end_tweets = "2021-01-01T00:00:00Z", is_retweet = FALSE, data_path = temp_path, bind_tweets = FALSE, verbose = FALSE)
x <- bind_tweets(temp_path, output_format = "tidy")
y <- bind_tweets(temp_path, output_format = "raw")
y$tweet.public_metrics.reply_count
#> # A tibble: 7 × 2
#>   tweet_id             data
#>   <chr>               <int>
#> 1 1339351148050796544     0
#> 2 1339037915280654337     0
#> 3 1338309611946991621     0
#> 4 1338221705358143489     0
#> 5 1336432113244114944     0
#> 6 1334620555668963328     0
#> 7 1334596974318903297     1
y$tweet.public_metrics.reply_count %>% rename(reply_count = "data") %>% left_join(x, by = "tweet_id")
#> # A tibble: 7 × 32
#>   tweet_id     reply_count user_username text  created_at conversation_id source
#>   <chr>              <int> <chr>         <chr> <chr>      <chr>           <chr> 
#> 1 13393511480…           0 shreerangp    "@Tw… 2020-12-1… 13393502089421… Twitt…
#> 2 13390379152…           0 SawavehVezhh… "@ge… 2020-12-1… 13390379152806… Twitt…
#> 3 13383096119…           0 graylanj      "@gg… 2020-12-1… 13381473230970… Twitt…
#> 4 13382217053…           0 chRSBGREEN    "Tha… 2020-12-1… 13380293740298… Twitt…
#> 5 13364321132…           0 Tuhung16      "@Th… 2020-12-0… 13338020140613… Twitt…
#> 6 13346205556…           0 Brodie9992    "&lt… 2020-12-0… 13346205556689… Twitt…
#> 7 13345969743…           1 jamie_maguir… "@Tw… 2020-12-0… 13345644888848… Twitt…
#> # … with 25 more variables: lang <chr>, in_reply_to_user_id <chr>,
#> #   possibly_sensitive <lgl>, author_id <chr>, user_name <chr>,
#> #   user_verified <lgl>, user_profile_image_url <chr>, user_description <chr>,
#> #   user_url <chr>, user_location <chr>, user_created_at <chr>,
#> #   user_protected <lgl>, user_pinned_tweet_id <chr>, retweet_count <int>,
#> #   like_count <int>, quote_count <int>, user_tweet_count <int>,
#> #   user_list_count <int>, user_followers_count <int>, …

Created on 2022-02-22 by the reprex package (v2.0.1)

Thanks for the help! I finally got the “reply_count” in a similar way as you suggested! Really thanks!

Then, as a user, I would appreciate it if "reply_count" could be a default column in tidy format in the future version of this package. The solution you provided here is useful, but it would be time-consuming if we are handling "big data". For example, I have around 500 files to bind_tweets(). I spent double time on bind_tweets() twice through "tidy" and "raw" respectively (and merging them). And, the column in the tidy version now is enough for sociology research if the reply_count could be added~

Thanks again for your prompt reply and help!

Best, Jacob

cjbarrie / academictwitteR

[FR] Add "reply_count" column in the "bind_tweets(data_path = "tweetdata", output_format = "tidy") " #294

Describe the solution you'd like

Anything else?