digitalmethodsinitiative / dmi-tcat

Digital Methods Initiative - Twitter Capture and Analysis Toolset
Apache License 2.0
367 stars 114 forks source link

Export rate limits and gap size per dataset #150

Open ErikBorra opened 8 years ago

ErikBorra commented 8 years ago

TCAT currently notifies the user when rate limits are encountered. Rate limits are reported per IP address from which tweets are captured. It'd be great if we could export per query bin and per time period how many tweets were (likely) rate limited for that bin. This would give an indication of the robustness of a data set in a specific period.

Per period we'd thus need to look at a) the nr of tweets captured on the IP address, b) the nr of tweets rate limited on the IP address, and c) the nr of tweets in a specific query bin (served by a specific IP address). This way we can deduce d) the share of tweets in a specific query bin w.r.t. the total nr of tweets captured, and e) calculate the share of tweets which are likely rate limited for a specific bin during that period.

ErikBorra commented 8 years ago

Also, it'd be great if we could indicate in that export at what time the server did not capture anything due to e.g. power or network outages.

danielcarter commented 8 years ago

I think this would be very useful. If possible, it would be great to know not just how many tweets were limited for each bin but also to have some indication of what time periods were most limited -- maybe similar to the volume graph on the analysis page. I think researchers I work with would want to know 1) to what extent a bin was rate limited and 2) whether any important dates were affected.

Also, it would be really nice if this updated to reflect any filters applied on the analysis page -- so, if I subset a bin and use that as my data set, I can get relevant rate limiting information.

ErikBorra commented 8 years ago

Thanks for your feedback. Good ideas!

dentoir commented 7 years ago

Dear all,

Today I stumbled upon some unexpected output from the Twitter API concerning the rate limit information as provided by Twitter via the sreaming API (basically the response values suddenly sharply changing in the response field). After some Googling, it is apparantly some unspecified behaviour on the API side. The issue is best described by others: https://twittercommunity.com/t/why-are-track-values-in-limit-notices-out-of-order-and-how-to-interpret-them/35729

If you study the thread above the bottom line is this makes it very difficult to create precise estimates. While the current TCAT export module appears to produce sane and realistic estimates for incidentally rate limited servers (such as 80% of our own servers), the figures are inflated for heavily rate limited servers. As long as the Twitter API documentation does not correlate to the observed output,. Therefore I've made the module invisible on the analysis panel for now.

As a side node: this is unrelated to TCAT issues with timezones or gaps as they have existed in the past, and only pertains to rate limit info. Suggestions are helpful. We could consider building a severely dressed down version of rate limit status information which is basically boolean: was there some rate limit indication during some period X?

Cheers,

Emile