CSDL-UMD / Rockwell

Rockwell uses the twitter authentication workflow to render a twitter like feed in order to collect information about the users interaction with their feed. It also has an attention check feature to ensure that the user is being observant of their feeds and not simply scrolling through with the intent of finishing quickly.
7 stars 2 forks source link

Fetch user engagements with v2 search #191

Closed glciampaglia closed 1 year ago

glciampaglia commented 1 year ago

See offline notes.

For the purpose of this pilot, we will rely on the Twitter API basic access with its 10k cap. We will only search NG tweets (max 100 results) for each participant. That would give us ~100 participants, which is enough for the pilot.

Issues we need to discuss

glciampaglia commented 1 year ago

We discussed again this issue. After reviewing the API limits, we will use the "recent search" endpoint (GET /2/tweets/search/recent) which allows to run 60 queries per 15 minutes per user.

To search for tweets, we will use search operators to perform a number of queries of the form:

from:XXX (url:YYY OR url:ZZZ OR ... )

Where XXX is the screen name of the participant, and YYY, ZZZ, etc. are NewsGuard domains. Since each query is limited to 512 characters maximum, we cannot insert all domains in a single query. Instead, we will break the list of domains in multiple batches so that the character length of each batch of domains fits within the 512-character limit, and run a query for each batch. This should be enough to cover all if not most of NewsGuard. (If there are too many domains, we could filter out non-US news domains.)

Things to consider and find out:

For example, if max results per participant is 200 and we have 10 batches of domains, then we would set max results per request to 20. We could keep paginations tokens for all batches and request next page, cycling through batches until we have collected 200 tweets.

saumyabhadani95 commented 1 year ago

We have been working on this issue. We have a preliminary cron job script ready. We found out the following:

  1. More than 60 queries will be needed to query all of NewsGuard. The exact number of queries would actually depend on the length of the Twitter handle of the participant, but even if that length is 1, more than 60 queries will be needed. So we need to run this as a cronjob.
  2. The number of queries are unfortunately very high because there are a lot of domains in NewsGuard and 512 character limit is very short. The average number of queries is more than 1000. And we can't set max_results parameter to be less than 10. So it's possible that for some participant, we might not be able to cover all of the NewsGuard domains, but that's fine. For training, as long as we get 100 or 200 NewsGuard tweets, it's fine. This would be a problem if the purpose is to analyze if a participant engages with non-trustworthy domains (for eligibility). But in that case, maybe we can check the non-trustworthy domains first.
  3. We don't need to make additional requests to fully hydrate second-level tweets. Unlike the home timeline JSON object, the object returned by the search API contains all the needed URLs in the first level tweet only.
saumyabhadani95 commented 1 year ago

Done.