cjbarrie / academictwitteR

Repo for academictwitteR package to query the Twitter Academic Research Product Track v2 API endpoint.
Other
272 stars 59 forks source link

tweets_lookup & error tracking for failed tweets #260

Closed TimBMK closed 2 years ago

TimBMK commented 2 years ago

Added a function to access the Tweets Lookup endpoint of the Twitter API v2, tweets_lookup. This is especially useful when looking to rehydrate tweet datasets. In order to track errors in the process (i.e., which and why tweets could not be rehydrated), optional error tracking was implemented into the get_tweets() and df_to_json() was implemented. Currently, this option is only implemented into tweets_lookup, but may prove useful for other applications

chainsawriot commented 2 years ago

@TimBMK Thanks so much for the contribution! I think this definitely needs some unit tests. But I will take care of this.

But in the meantime, could you provide us with your ORCID (for the DESCRIPTION)?

TimBMK commented 2 years ago

Sure thing! I've written the functions for a project and figured I might as well contribute them. Sorry if it's not as clean as it could be - I'm somewhat new to using git. Here's my ORCID: https://orcid.org/0000-0002-2852-2690

chainsawriot commented 2 years ago

@TimBMK Not bad for a novice! Keep it up!

Once again, thanks for your contribution and I will definitely add your ORCID to the DESCRIPTION.

As there are actually 2 features (tweets_lookup and error tracking), I will merge your PR to two different feature branches and I will merge them to the Master individually.

Also, I want to ask you: As all functions in this package use the naming format of "verb_noun", would you mind I rename the function tweets_lookup to the "verb_noun" format? Would it be nice for the function to be called hydrate_tweets (get_tweets is already an internal function and the name overlaps a lot with get_all_tweets)?

TimBMK commented 2 years ago

Yes, I was thinking about the naming conventions also. I wanted to avoid producing name conflicts with rtweet's lookup_tweets function. For consistency with the Twitter naming scheme, I went with tweets_lookup (as they call it in their API documentation). However, I'm open to naming suggestions. While hydration is technically not the only function of the tweets lookup endpoint, it is probably the most common in an academic context. So feel free to rename it accordingly!

cjbarrie commented 2 years ago

Thank you very much, @TimBMK! This is very useful contribution

TimBMK commented 2 years ago

Found a bug that would cause the function to fail when bind_tweets == F. See latest commit in my branch

chainsawriot commented 2 years ago

@TimBMK You see that I pushed a commit to your Master? I think I will incorporate your fix on bind_tweet and push it to your Master.

There might be other bugs (thus we need Unit Tests).

TimBMK commented 2 years ago

Yes, feel free. This was one obvious one that needed fixing, especially since the bind_tweets becomes a bottleneck when rehydrating very large datasets

chainsawriot commented 2 years ago

@TimBMK As you might already know, I have separated your PR into two parts. So let's focus on the hydrate part first. (We'll deal with the error capture part later. okay?)

I've added the documentation, some test cases, fixed a few bugs, and made the behaviors of the function similar to get_all_tweets (e.g. the function does work when bind_tweets is FALSE and data_path is NULL; and TONS OF WARNINGS when verbose is TRUE). I think it should be mergeable. But before I merge, please give it a try because I might have misinterpreted your intentions.

It's in your Master und mit freundlichen Grüßen aus Mannheim. Vielen Dank!

TimBMK commented 2 years ago

I fixed a bug that would lead to the function failing when verbose and bind_tweets==F. Other than that, we could consider cleaning the verbose a little, since the tweets endpoint functions slightly different. That is, each call is batched into 100 tweets (the maximum the endpoint allows), so it never returns more than one page. (Correct me here if there's a way to change that - it might speed up the function!) So the page info (Total Pages queried..., This is the last page for...) is somewhat redundant - which is why I aimed to replace it with new messages reporting on tweets retrieved. But I don't want to interfere with your reporting conventions, so let me know what you think is best here?

Gruß zurück aus Berlin und immer gerne!

chainsawriot commented 2 years ago

@TimBMK Fair enough, I'm actually not a fan of the verbosity of the package (thus I actively enforce the verbose option). But this is not "my package" but "our package".

Another option that we can experiment with is to use a progress bar. Like bind_tweets

https://github.com/cjbarrie/academictwitteR/blob/352e359bc69dc23148be33e27f71065610d316a2/R/bind_tweets.R#L41-L57

For a function that is mostly used for hydrating a large number of tweet ids, probably a minimalistic progress bar is less verbose (even when verbose is TRUE).

chainsawriot commented 2 years ago

It's less verbose now (in your master).

devtools::load_all("~/dev/academictwitteR")
#> ℹ Loading academictwitteR
emptydir <- academictwitteR:::.gen_random_dir()
fff <- readRDS("~/dev/academictwitteR/tests/testdata/fff_de.RDS")
manyf <- c(fff, fff, fff)

res <- hydrate_tweets(manyf, verbose = TRUE, bind_tweets = FALSE, data_path = emptydir)
#> Batch 1 out of 4 : ids 1266876474440761346 to 1266814311298629640 
#> Batch 2 out of 4 : ids 1266813392293150721 to 1266819115236364290 
#> Batch 3 out of 4 : ids 1266818671952957440 to 1266821913831235584 
#> Batch 4 out of 4 : ids 1266821749464858632 to 1266810962985848833
unlink(emptydir, recursive = TRUE)

Created on 2021-12-17 by the reprex package (v2.0.1)

chainsawriot commented 2 years ago

I've also made an experimental branch using the progress bar (along n_batches). I think it looks cleaner. But please let me know how you think @TimBMK

devtools::load_all("~/dev/academictwitteR")
#> ℹ Loading academictwitteR
emptydir <- academictwitteR:::.gen_random_dir()
fff <- readRDS("~/dev/academictwitteR/tests/testdata/fff_de.RDS")
manyf <- c(fff, fff, fff)

res <- hydrate_tweets(manyf, verbose = TRUE, bind_tweets = FALSE, data_path = emptydir)
#> Hydrating... 
#> ================================================================================
unlink(emptydir, recursive = TRUE)

Created on 2021-12-17 by the reprex package (v2.0.1)

TimBMK commented 2 years ago

I'm not generally opposed to verbosity, I just think in this instance it does not convey meaningful information. So I like the minimal verbosity apporach you implemented, thanks! While I like the progress bar (and was thinking of implementing it myself), I found it extremely helpful in practice to have a printout of the tweet_ids that are being run. So that if the script fails for whatever reason, you know where to pick up without having to search through the data retrieved. I've proposed a change to the experimental branch to make the progress bar an option seperate from verbose, so users can choose. However, I haven't found an option yet to make them work well together. Any ideas on that matter? Alternatively, we could fit the verbose option with "progress_bar" or "ids" options to let users choose

chainsawriot commented 2 years ago

OK then, I will merge your master branch. We can deal with the progress bar for the future.