glciampaglia commented 1 year ago

~~See offline notes.~~

For the purpose of this pilot, we will rely on the Twitter API basic access with its 10k cap. We will only search NG tweets (max 100 results) for each participant. That would give us ~100 participants, which is enough for the pilot.

Issues we need to discuss

[x] Query string in Search API is max 512 characters, which is much shorter than rules from academic filtered stream.
[x] This means we may need to break down each participant into multiple queries, and make sure we fit with 100 tweets cap per participant.
[x] It is not possible to search for favorites, can only fetch retweets/tweets (posted as opposed to liked).

glciampaglia commented 1 year ago

We discussed again this issue. After reviewing the API limits, we will use the "recent search" endpoint (GET /2/tweets/search/recent) which allows to run 60 queries per 15 minutes per user.

To search for tweets, we will use search operators to perform a number of queries of the form:

from:XXX (url:YYY OR url:ZZZ OR ... )

Where XXX is the screen name of the participant, and YYY, ZZZ, etc. are NewsGuard domains. Since each query is limited to 512 characters maximum, we cannot insert all domains in a single query. Instead, we will break the list of domains in multiple batches so that the character length of each batch of domains fits within the 512-character limit, and run a query for each batch. This should be enough to cover all if not most of NewsGuard. (If there are too many domains, we could filter out non-US news domains.)

Things to consider and find out:

[x] How many queries are needed to request all of NewsGuard. If less than 60, then we can run this as part of Qualtrics task, otherwise, need to move it to a cron job.
[x] To make sure we fetch engagements for all domains, we will need to limit the number of results per request so that we have enough tweet usage to cover all batches.
[x] Do we need to make additional requests to fully hydrate second-level tweets?

For example, if max results per participant is 200 and we have 10 batches of domains, then we would set max results per request to 20. We could keep paginations tokens for all batches and request next page, cycling through batches until we have collected 200 tweets.

saumyabhadani95 commented 1 year ago

We have been working on this issue. We have a preliminary cron job script ready. We found out the following:

More than 60 queries will be needed to query all of NewsGuard. The exact number of queries would actually depend on the length of the Twitter handle of the participant, but even if that length is 1, more than 60 queries will be needed. So we need to run this as a cronjob.
The number of queries are unfortunately very high because there are a lot of domains in NewsGuard and 512 character limit is very short. The average number of queries is more than 1000. And we can't set max_results parameter to be less than 10. So it's possible that for some participant, we might not be able to cover all of the NewsGuard domains, but that's fine. For training, as long as we get 100 or 200 NewsGuard tweets, it's fine. This would be a problem if the purpose is to analyze if a participant engages with non-trustworthy domains (for eligibility). But in that case, maybe we can check the non-trustworthy domains first.
We don't need to make additional requests to fully hydrate second-level tweets. Unlike the home timeline JSON object, the object returned by the search API contains all the needed URLs in the first level tweet only.

saumyabhadani95 commented 1 year ago

Done.

CSDL-UMD / Rockwell

Fetch user engagements with v2 search #191

Issues we need to discuss