cjbarrie / academictwitteR

Repo for academictwitteR package to query the Twitter Academic Research Product Track v2 API endpoint.
Other
272 stars 59 forks source link

[BUG] Diminishing data on get_user_following() & get_user_followers() #323

Open TimBMK opened 2 years ago

TimBMK commented 2 years ago

Please confirm the following

Describe the bug

Both functions get_user_following() and get_user_followers() at some point start returning less datapoints (i.e. follower/followee) for users where there should be considerably more. Eventually, this comes down to 1 datapoint. This, however, is not an actual datapoint: it is a row with all NULL/NA values except the from_id. That is, eventually, no actual data is returned from the endpoint, but the function states it as a returned datapoint (which makes the bug hard to notice). This seems to only start occuring after the rate limit has been hit more than once. This may suggest a problem with the API side. However, I did not find any statements on rate limits other than the 15 lookups per 15 minutes stated here.

Expected Behavior

When looking up larger numbers of users' followers/followees through the respective functions, I would expect them to consistently return the correct data. If additional rate limits need to be adherred, I would expect the function to do this in line with the "sleep" behaviour already implemented. If this is not possible, I would expect the function to throw an error, rather than silently returning no data.

Steps To Reproduce

test_users <- c("827090742162100224", "1683845126", "1377117206", "1073605033", 
"551802475", "262730721", "69814084", "53617577", "47151012", 
"38272947", "24674518", "19924848", "910913583168573440", "347792540", 
"308415277", "127620350", "46085533", "18933321", "1340469592666324992", 
"1235523759597068288", "1201838123841404929", "1187686813168820224", 
"1181598578508214272", "1159457070124605442", "1159072277746593792", 
"1148916458983964673", "1135510177292197888", "1132260527378489345", 
"1127961248493129728", "1125751445205262336", "1113004743733993472", 
"1096063189249331201", "1095023450790547457", "1085599614122762242", 
"1085494381405237251", "1083371289786634240", "1082925763316367360", 
"1071665423140249600", "1070312229139025920", "1061873979760263168", 
"1052524172717375488", "1042109887604563968", "1040160799208161280", 
"1020297087970435072", "1019846605044936705", "1009457791554281473", 
"1009025255753601024", "1006099793888841729", "1003656299589132291", 
"1001837201276235777")

get_user_following(test_users, bearer_token = bearer_token)

For me, the problem started occuring about halfway through the second chunk (after the first rest). Notice how user 18933321 returns only 111 instead of its actual 1.076 users. Afterwards, we get only one (empty) datapoint per user. This behaviour might vary per use, but I can trace the exact same pattern in another log. Here's the log for the above example:

Processing 827090742162100224
Total data points:  1000 
Total data points:  1246 
This is the last page for  827090742162100224 : finishing collection. 
Processing 1683845126
Total data points:  2152 
This is the last page for  1683845126 : finishing collection. 
Processing 1377117206
Total data points:  2463 
This is the last page for  1377117206 : finishing collection. 
Processing 1073605033
Total data points:  2490 
This is the last page for  1073605033 : finishing collection. 
Processing 551802475
Total data points:  2491 
This is the last page for  551802475 : finishing collection. 
Processing 262730721
Total data points:  2626 
This is the last page for  262730721 : finishing collection. 
Processing 69814084
Total data points:  2637 
This is the last page for  69814084 : finishing collection. 
Processing 53617577
Rate limit reached. Rate limit will reset at 2022-05-24 15:20:40 
Sleeping for 438 seconds. 
==========================================================================================================================================================================================Total data points:  2664 
This is the last page for  53617577 : finishing collection. 
Processing 47151012
Total data points:  2665 
This is the last page for  47151012 : finishing collection. 
Processing 38272947
Total data points:  2699 
This is the last page for  38272947 : finishing collection. 
Processing 24674518
Total data points:  3527 
This is the last page for  24674518 : finishing collection. 
Processing 19924848
Total data points:  3528 
This is the last page for  19924848 : finishing collection. 
Processing 910913583168573440
Total data points:  3907 
This is the last page for  910913583168573440 : finishing collection. 
Processing 347792540
Total data points:  4506 
This is the last page for  347792540 : finishing collection. 
Processing 308415277
Total data points:  4512 
This is the last page for  308415277 : finishing collection. 
Processing 127620350
Total data points:  4818 
This is the last page for  127620350 : finishing collection. 
Processing 46085533
Total data points:  5533 
This is the last page for  46085533 : finishing collection. 
Processing 18933321
Total data points:  6533 
Total data points:  6564 
This is the last page for  18933321 : finishing collection. 
Processing 1340469592666324992
Total data points:  6565 
This is the last page for  1340469592666324992 : finishing collection. 
Processing 1235523759597068288
Total data points:  6566 
This is the last page for  1235523759597068288 : finishing collection. 
Processing 1201838123841404929
Total data points:  6567 
This is the last page for  1201838123841404929 : finishing collection. 
Processing 1187686813168820224
Rate limit reached. Rate limit will reset at 2022-05-24 15:35:51 
Sleeping for 862 seconds. 

The exact same issue occurs with get_user_followers():

get_user_followers(test_users, bearer_token = bearer_token)
Processing 827090742162100224
Total data points:  1000 
Total data points:  2000 
Total data points:  3000 
Total data points:  4000 
Total data points:  5000 
Total data points:  5221 
This is the last page for  827090742162100224 : finishing collection. 
Processing 1683845126
Total data points:  6203 
This is the last page for  1683845126 : finishing collection. 
Processing 1377117206
Total data points:  6304 
This is the last page for  1377117206 : finishing collection. 
Processing 1073605033
Total data points:  6460 
This is the last page for  1073605033 : finishing collection. 
Processing 551802475
Total data points:  6461 
This is the last page for  551802475 : finishing collection. 
Processing 262730721
Total data points:  6509 
This is the last page for  262730721 : finishing collection. 
Processing 69814084
Total data points:  6809 
This is the last page for  69814084 : finishing collection. 
Processing 53617577
Total data points:  6966 
This is the last page for  53617577 : finishing collection. 
Processing 47151012
Total data points:  7041 
This is the last page for  47151012 : finishing collection. 
Processing 38272947
Total data points:  7325 
This is the last page for  38272947 : finishing collection. 
Processing 24674518
Rate limit reached. Rate limit will reset at 2022-05-24 15:45:01 
Sleeping for 856 seconds. 
==========================================================================================================================================================================================Total data points:  8325 
Total data points:  9007 
This is the last page for  24674518 : finishing collection. 
Processing 19924848
Total data points:  9008 
This is the last page for  19924848 : finishing collection. 
Processing 910913583168573440
Total data points:  9009 
This is the last page for  910913583168573440 : finishing collection. 
Processing 347792540
Total data points:  9401 
This is the last page for  347792540 : finishing collection. 
Processing 308415277
Total data points:  9588 
This is the last page for  308415277 : finishing collection. 
Processing 127620350
Total data points:  9997 
This is the last page for  127620350 : finishing collection. 
Processing 46085533
Total data points:  10997 
Total data points:  11997 
Total data points:  12248 
This is the last page for  46085533 : finishing collection. 
Processing 18933321
Total data points:  12795 
This is the last page for  18933321 : finishing collection. 
Processing 1340469592666324992
Total data points:  12796 
This is the last page for  1340469592666324992 : finishing collection. 
Processing 1235523759597068288
Total data points:  12797 
This is the last page for  1235523759597068288 : finishing collection. 
Processing 1201838123841404929
Total data points:  12798 
This is the last page for  1201838123841404929 : finishing collection. 
Processing 1187686813168820224
Total data points:  12799 
This is the last page for  1187686813168820224 : finishing collection. 
Processing 1181598578508214272
Rate limit reached. Rate limit will reset at 2022-05-24 16:00:21 
Sleeping for 858 seconds. 

Environment

R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8    LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                    LC_TIME=German_Germany.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] igraph_1.3.1          academictwitteR_0.3.1 forcats_0.5.1         stringr_1.4.0         dplyr_1.0.9           purrr_0.3.4           readr_2.1.2           tidyr_1.2.0          
 [9] tibble_3.1.7          ggplot2_3.3.6         tidyverse_1.3.1      

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.2 xfun_0.30        haven_2.5.0      colorspace_2.0-3 vctrs_0.4.1      generics_0.1.2   usethis_2.1.5    htmltools_0.5.2  yaml_2.3.5       utf8_1.2.2      
[11] rlang_1.0.2      pillar_1.7.0     glue_1.6.2       withr_2.5.0      DBI_1.1.2        dbplyr_2.1.1     modelr_0.1.8     readxl_1.4.0     lifecycle_1.0.1  munsell_0.5.0   
[21] gtable_0.3.0     cellranger_1.1.0 rvest_1.0.2      evaluate_0.15    knitr_1.39       tzdb_0.3.0       fastmap_1.1.0    curl_4.3.2       fansi_1.0.3      broom_0.8.0     
[31] scales_1.2.0     backports_1.4.1  jsonlite_1.8.0   fs_1.5.2         hms_1.1.1        digest_0.6.29    stringi_1.7.6    grid_4.2.0       cli_3.3.0        tools_4.2.0     
[41] magrittr_2.0.3   crayon_1.5.1     pkgconfig_2.0.3  ellipsis_0.3.2   xml2_1.3.3       reprex_2.0.1     lubridate_1.8.0  assertthat_0.2.1 rmarkdown_2.14   httr_1.4.3      
[51] rstudioapi_0.13  R6_2.5.1         compiler_4.2.0  

Anything else?

This seems to be the same issue as in #187. However, pinning the problem down to the second half of the second batch might be helpful in tracking the problem down. Let me know if there's anything I can do to help solve the issue / stress test more. Slightly lost as to what may cause that problem atm. Especially since the functions fails completely after reaching the first rate limit (rather than only failing for a number of requests until the rate limit recovers). Furthermore, the fact that it starts returning less data before failing completely suggest there may be an additional rate limit at play, limiting the returns rather than only the requests?

TimBMK commented 2 years ago

I've done some additional testing / troubleshooting. The issue may be connected to the Rate Limits being handled incorrectly by get_user_edges(). I put in an additional 30 minute rest timer every 15 users. Note how the function starts failing (only one datapoint returned for a number of accounts) after the sleep timers and bounces back after the additional 30min rest:

 ========== Get Followers ========== 
Processing 827090742162100224
Total data points:  1000 
Total data points:  2000 
Total data points:  3000 
Total data points:  4000 
Total data points:  5000 
Total data points:  5223 
This is the last page for  827090742162100224 : finishing collection. 
Processing 1683845126
Total data points:  6206 
This is the last page for  1683845126 : finishing collection. 
Processing 1377117206
Total data points:  6307 
This is the last page for  1377117206 : finishing collection. 
Processing 1073605033
Total data points:  6463 
This is the last page for  1073605033 : finishing collection. 
Processing 551802475
Total data points:  6464 
This is the last page for  551802475 : finishing collection. 
Processing 262730721
Total data points:  6512 
This is the last page for  262730721 : finishing collection. 
Processing 69814084
Total data points:  6812 
This is the last page for  69814084 : finishing collection. 
Processing 53617577
Total data points:  6969 
This is the last page for  53617577 : finishing collection. 
Processing 47151012
Total data points:  7044 
This is the last page for  47151012 : finishing collection. 
Processing 38272947
Total data points:  7328 
This is the last page for  38272947 : finishing collection. 
Processing 24674518
Rate limit reached. Rate limit will reset at 2022-05-25 12:47:56 
Sleeping for 859 seconds. 
================================================================================Total data points:  8328 
Total data points:  9010 
This is the last page for  24674518 : finishing collection. 
Processing 19924848
Total data points:  9011 
This is the last page for  19924848 : finishing collection. 
Processing 910913583168573440
Total data points:  9012 
This is the last page for  910913583168573440 : finishing collection. 
Processing 347792540
Total data points:  9404 
This is the last page for  347792540 : finishing collection. 
Processing 308415277
Total data points:  9591 
This is the last page for  308415277 : finishing collection. 

 ==== Rest 30 Minutes ==== 
Processing 127620350
Total data points:  1000 
Total data points:  2000 
Total data points:  3000 
Total data points:  4000 
Total data points:  5000 
Total data points:  6000 
Total data points:  7000 
Total data points:  8000 
Total data points:  9000 
Total data points:  10000 
Total data points:  11000 
Total data points:  12000 
Total data points:  13000 
Total data points:  14000 
Total data points:  14176 
This is the last page for  127620350 : finishing collection. 
Processing 46085533
Rate limit reached. Rate limit will reset at 2022-05-25 13:33:18 
Sleeping for 850 seconds. 
================================================================================Total data points:  15176 
Total data points:  15683 
This is the last page for  46085533 : finishing collection. 
Processing 18933321
Total data points:  16400 
This is the last page for  18933321 : finishing collection. 
Processing 1340469592666324992
Total data points:  16401 
This is the last page for  1340469592666324992 : finishing collection. 
Processing 1235523759597068288
Total data points:  16402 
This is the last page for  1235523759597068288 : finishing collection. 
Processing 1201838123841404929
Total data points:  16403 
This is the last page for  1201838123841404929 : finishing collection. 
Processing 1187686813168820224
Total data points:  16404 
This is the last page for  1187686813168820224 : finishing collection. 
Processing 1181598578508214272
Total data points:  16405 
This is the last page for  1181598578508214272 : finishing collection. 
Processing 1159457070124605442
Total data points:  16406 
This is the last page for  1159457070124605442 : finishing collection. 
Processing 1159072277746593792
Total data points:  16407 
This is the last page for  1159072277746593792 : finishing collection. 
Processing 1148916458983964673
Total data points:  16408 
This is the last page for  1148916458983964673 : finishing collection. 
Processing 1135510177292197888
Total data points:  16409 
This is the last page for  1135510177292197888 : finishing collection. 
Processing 1132260527378489345
Total data points:  16410 
This is the last page for  1132260527378489345 : finishing collection. 
Processing 1127961248493129728
Total data points:  16411 
This is the last page for  1127961248493129728 : finishing collection. 
Processing 1125751445205262336
Total data points:  16412 
This is the last page for  1125751445205262336 : finishing collection. 

 ==== Rest 30 Minutes ==== 
Processing 1113004743733993472
Total data points:  476 
This is the last page for  1113004743733993472 : finishing collection. 
Processing 1096063189249331201
Total data points:  1476 
Total data points:  2476 
Total data points:  2561 
This is the last page for  1096063189249331201 : finishing collection. 
Processing 1095023450790547457
Total data points:  2677 
This is the last page for  1095023450790547457 : finishing collection. 
Processing 1085599614122762242
Total data points:  2678 
This is the last page for  1085599614122762242 : finishing collection. 
Processing 1085494381405237251
Total data points:  2822 
This is the last page for  1085494381405237251 : finishing collection. 
Processing 1083371289786634240
Total data points:  2823 
This is the last page for  1083371289786634240 : finishing collection. 
Processing 1082925763316367360
Total data points:  2831 
This is the last page for  1082925763316367360 : finishing collection. 
Processing 1071665423140249600
Total data points:  2869 
This is the last page for  1071665423140249600 : finishing collection. 
Processing 1070312229139025920
Total data points:  3004 
This is the last page for  1070312229139025920 : finishing collection. 
Processing 1061873979760263168
Total data points:  3365 
This is the last page for  1061873979760263168 : finishing collection. 
Processing 1052524172717375488
Total data points:  3554 
This is the last page for  1052524172717375488 : finishing collection. 
Processing 1042109887604563968
Total data points:  3641 
This is the last page for  1042109887604563968 : finishing collection. 
Processing 1040160799208161280
Total data points:  4641 
Rate limit reached. Rate limit will reset at 2022-05-25 14:18:59 
Sleeping for 859 seconds. 
TimBMK commented 2 years ago

I've run some more checks. For one, the rate limit is being hit what seems to be 15 pages, not 15 users. This is rather unclear in both the Twitter API and package Docs. However, the actual problem starts as soon as the rate limit was hit and collection is resumed after the sleep timer. Strangely, though, this is not immediately the case. In my tests, after hitting the rate limit and sleeping, the current ID lookup concludes, the next lookup works fine and only then does it start silently returning empty data.

See this log where I added an extra 15 minute grace period after every 5 accounts:

 ==== Rest 15 Minutes ==== 
Processing 127620350
Total data points:  1000 
Total data points:  2000 
Total data points:  3000 
Total data points:  4000 
Total data points:  5000 
Total data points:  6000 
Total data points:  7000 
Total data points:  8000 
Total data points:  9000 
Total data points:  10000 
Total data points:  11000 
Total data points:  12000 
Total data points:  13000 
Total data points:  14000 
Total data points:  14167 
This is the last page for  127620350 : finishing collection. 
Processing 46085533
Rate limit reached. Rate limit will reset at 2022-06-03 12:21:38 
Sleeping for 849 seconds. 
================================================================================Total data points:  15167 
Total data points:  15652 
This is the last page for  46085533 : finishing collection. 
Processing 18933321
Total data points:  16358 
This is the last page for  18933321 : finishing collection. 
Processing 1340469592666324992
Total data points:  16359 
This is the last page for  1340469592666324992 : finishing collection. 
Processing 1235523759597068288
Total data points:  16360 
This is the last page for  1235523759597068288 : finishing collection. 

I am, however, at quite a loss what may cause this behaviour. Maybe the function does not track the right rate limit? That is, does .check_reset() need to distinguish between different rate limits or does it always fetch the correct limit to pass on to .trigger_sleep()?

TimBMK commented 2 years ago

After additional tests, I can confirm that data returns remain unreliable even if data is returned. In many cases, you do not get all followers of a given user ID. I'm using a workaround now, confirming the number of followers pulled witht the number stated in get_user_profile and re-running if necessary. Doing this I noticed that the data returned seems relatively reliable if calls are wrapped in a for-loop, i.e. when checking every ID with a single call rather than in batches. Using a for-loop to loop through the vector of user IDs might therefore be a valid solution for this issue.

Until it is resolved, however, I would recommend putting these functions in hiatus or at least put a warnig label on them, as results can be grossly misleading.

cjbarrie commented 2 years ago

Thank you for this, @TimBMK and apologies for slow comms. I've been away on holiday. I'll add this to TODOs for next release