Closed glciampaglia closed 1 year ago
We discussed again this issue. After reviewing the API limits, we will use the "recent search" endpoint (GET /2/tweets/search/recent
) which allows to run 60 queries per 15 minutes per user.
To search for tweets, we will use search operators to perform a number of queries of the form:
from:XXX (url:YYY OR url:ZZZ OR ... )
Where XXX
is the screen name of the participant, and YYY
, ZZZ
, etc. are NewsGuard domains. Since each query is limited to 512 characters maximum, we cannot insert all domains in a single query. Instead, we will break the list of domains in multiple batches so that the character length of each batch of domains fits within the 512-character limit, and run a query for each batch. This should be enough to cover all if not most of NewsGuard. (If there are too many domains, we could filter out non-US news domains.)
Things to consider and find out:
For example, if max results per participant is 200 and we have 10 batches of domains, then we would set max results per request to 20. We could keep paginations tokens for all batches and request next page, cycling through batches until we have collected 200 tweets.
We have been working on this issue. We have a preliminary cron job script ready. We found out the following:
max_results
parameter to be less than 10. So it's possible that for some participant, we might not be able to cover all of the NewsGuard domains, but that's fine. For training, as long as we get 100 or 200 NewsGuard tweets, it's fine. This would be a problem if the purpose is to analyze if a participant engages with non-trustworthy domains (for eligibility). But in that case, maybe we can check the non-trustworthy domains first.Done.
See offline notes.For the purpose of this pilot, we will rely on the Twitter API basic access with its 10k cap. We will only search NG tweets (max 100 results) for each participant. That would give us ~100 participants, which is enough for the pilot.
Issues we need to discuss