cjbarrie / academictwitteR

Repo for academictwitteR package to query the Twitter Academic Research Product Track v2 API endpoint.
Other
272 stars 59 forks source link

[BUG] bind_tweets(): 'Column `id` doesn't exist.' with empty data_.json #304

Open TimBMK opened 2 years ago

TimBMK commented 2 years ago

Please confirm the following

Describe the bug

As soon as there is a .json file in the datapath of bindtweets without an ID ("data.json"), the function fails with an error if set to the "tidy" format. Generating the "raw" format, however, is not an issue. The following error occures:

Error in `stop_subscript()`:
! Can't rename columns that don't exist.
x Column `id` doesn't exist.
Backtrace:
  1. academictwitteR::bind_tweets(data_path = "data/2017", output_format = "tidy")
  9. dplyr:::rename.data.frame(., pki = tidyselect::all_of(pkicol))
 10. tidyselect::eval_rename(expr(c(...)), .data)
 11. tidyselect:::rename_impl(...)
 12. tidyselect:::eval_select_impl(...)
 21. tidyselect:::vars_select_eval(...)
 22. tidyselect:::walk_data_tree(expr, data_mask, context_mask, error_call)
 23. tidyselect:::eval_c(expr, data_mask, context_mask)
 24. tidyselect:::reduce_sels(node, data_mask, context_mask, init = init)
 25. tidyselect:::walk_data_tree(new, data_mask, context_mask)
 26. tidyselect:::as_indices_sel_impl(...)
 27. tidyselect:::as_indices_impl(x, vars, call = call, strict = strict)
 28. tidyselect:::chr_as_locations(x, vars, call = call)
 29. vctrs::vec_as_location(x, n = length(vars), names = vars)
 30. vctrs `<fn>`()
 31. vctrs:::stop_subscript_oob(...)
 32. vctrs:::stop_subscript(...)
Run `rlang::last_trace()` to see the full context.

The data_.json is usually an empty file, but it seems to get generated whenever native academictwitteR functions do not return any twitter data (empty pages). The last three times I used get_usertimeline(), I ended up with these empty files. Deleting the data.json file fixes the error. Furthermore, I believe the problem only started occuring after I updated academictwitteR to 0.3.1. I don't think it occured under 0.2.1.

Expected Behavior

I would suggest some sort of failsafe that automatically skips .json files without the ID, as they seem to be empty anyways.

Steps To Reproduce

users <- c("303234771", "2821282972", "84803032", "154096311", "2615232002", "37776042", "2282315483", "405599246", "1060861584938057728", "85161049")

get_user_timeline(x = users,
                  start_tweets = "2017-04-01T00:00:00Z",
                  end_tweets = "2017-06-01T00:00:00Z",
                  bearer_token = bearer_token,
                  n = 3200,
                  data_path = "data/test",
                  bind_tweets = F) 

list.files("data/test")
[1] "data_.json"                    "data_848204306566320128.json"  "data_848950153520218113.json"  "users_.json"                   "users_848204306566320128.json"
[6] "users_848950153520218113.json"

data <- bind_tweets(data_path = "data/test", output_format = "tidy")

data_raw <- bind_tweets(data_path = "data/test", output_format = "raw")

Environment

sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] academictwitteR_0.3.1

loaded via a namespace (and not attached):
 [1] fansi_1.0.2      assertthat_0.2.1 utf8_1.2.2       crayon_1.5.0     dplyr_1.0.8      R6_2.5.1         jsonlite_1.8.0   DBI_1.1.2        lifecycle_1.0.1  magrittr_2.0.2  
[11] pillar_1.7.0     rlang_1.0.1      cli_3.2.0        rstudioapi_0.13  fs_1.5.2         vctrs_0.3.8      generics_0.1.2   ellipsis_0.3.2   tools_4.1.2      glue_1.6.2      
[21] purrr_0.3.4      compiler_4.1.2   pkgconfig_2.0.3  tidyselect_1.1.2 tibble_3.1.6     usethis_2.1.5

Anything else?

Possibly related to #218

chainsawriot commented 2 years ago

@TimBMK Thanks for reporting the bug. I can reproduce this.

require(academictwitteR)
#> Loading required package: academictwitteR
users <- c("303234771", "2821282972", "84803032", "154096311", "2615232002", "37776042", "2282315483", "405599246", "1060861584938057728", "85161049")

tempdir <- academictwitteR:::.gen_random_dir()

get_user_timeline(x = users,
                  start_tweets = "2017-04-01T00:00:00Z",
                  end_tweets = "2017-06-01T00:00:00Z",
                  n = 3200,
                  data_path = tempdir,
                  bind_tweets = FALSE,
                  verbose = FALSE)
#> data frame with 0 columns and 0 rows

list.files(tempdir)
#> [1] "data_.json"                    "data_848204306566320128.json" 
#> [3] "data_848950153520218113.json"  "query"                        
#> [5] "users_.json"                   "users_848204306566320128.json"
#> [7] "users_848950153520218113.json"
data <- bind_tweets(data_path = tempdir, output_format = "tidy")
#> Error in `stop_subscript()`:
#> ! Can't rename columns that don't exist.
#> ✖ Column `id` doesn't exist.
data_raw <- bind_tweets(data_path = tempdir, output_format = "raw")

Created on 2022-03-10 by the reprex package (v2.0.1)

There are actually two issues here:

  1. get_user_timeline shouldn't generate those empty json files in the first place.
  2. bind_tweets can't handle those empty json files.
chainsawriot commented 2 years ago

@TimBMK I will keep this issue focusing on only the second issue. And I will open another issue related to the first one.

psalmuel19 commented 2 years ago

Hello,

This worked for me: batch_four <- bind_tweets('data' user = FALSE, verbose = TRUE, output_format = "raw")

but when trying to convert to csv with: write.csv(batch_four, 'batch_4.csv')

I get this error: Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 538, 519, 575, 392, 190, 1, 282, 603, 111

TimBMK commented 2 years ago

@psalmuel19 this is unrelated to the issue mentioned above, as it is clearly caused by write.csv() rather than the bind_tweets() function. I suspect the nested lists in the raw data format to cause problems. Try unnesting batch_four or use output_format = "tidy" when binding the tweets. If the issue persists, please open a seperate issue.

psalmuel19 commented 2 years ago

@TimBMK I should have mentioned that did that and got the error below: batch_four <- bind_tweets('data', user = FALSE, verbose = TRUE, output_format = "tidy") Error in chr_as_locations(): ! Can't rename columns that don't exist. ✖ Column id doesn't exist.

While searching for a solution, I came across the output_format = "raw" code. It worked in binding but I now can't convert to csv. Any suggestions please?

TimBMK commented 2 years ago

As mentioned in the original post, the easiest fix to get the tidy format to work is to go into the folder with the data and manually delete the empty "data_.json" files. This fixes the issue with the tidy format, as the issue with the non-existent id column does not come up.

The raw format does not output a dataframe, but a list of tibbles (a type of dataframe) of different length containing different information (this is what the API returns originally). If you are set on using the raw format, you will have to decide what information you want to export to .csv. If you look at the structure of the raw data object (batch_four in your case), it is relatively self-explanatory what you get in each of the tibbles. An easy way to do this yourself is with names(batch_four) In order to export the data, you can write the tibbles by referencing them explicitly, e.g. write.csv(batch_four$tweet.main, file = "batch_4.csv") tweet.main contains the main information of the tweet; additional information (e.g. metrics) would need to be matched together. You can use dplyr's left_join() function for this and use the tweet_id as an indicator for matching. As I mentioned above, however, removing the problematic files by hand will enable the tidy format, which gives you all relevant data in a neat and ready-made format.