Create eligibilty endpoint for qualtrics survey - Githubissues

CSDL-UMD / Rockwell

Rockwell uses the twitter authentication workflow to render a twitter like feed in order to collect information about the users interaction with their feed. It also has an attention check feature to ensure that the user is being observant of their feeds and not simply scrolling through with the intent of finishing quickly.

7 stars 2 forks source link

Create eligibilty endpoint for qualtrics survey #167

Closed glciampaglia closed 2 years ago

glciampaglia commented 2 years ago

We need an endpoint that can be queried and returns true/false based on the following:

Accept participating users who meet following criteria - others ineligible
  - Have XXX tweets in their home timeline including XXX with links to NewsGuard domains. 
    If there are not enough tweets with links to NewsGuard domains, then exclude
  - Has engaged at least XXX times with domains in NewsGuard in the past

First, we need to estimate the XXXs via MTurk pilot, so at the first the endpoint should just collect these data.

glciampaglia commented 2 years ago

We discussed this issue, we have a first implementation in node.JS. We need to use v1.1 endpoint of home timeline with both exclude_replies and include_entities set to true, trim_user set to false, and count set to 200. In total we want to get up to 800 tweets, so we will need to make at least 4 requests per user. Subsequent requests after the first one, need to include since_id parameter too.

glciampaglia commented 2 years ago

We have implemented the v1.1 endpoint request based on the above.

For the MTurk pilot, we will compute the stats needed for eligibility, and then dump the tweets into a json file, keyed by Twitter user_id of MTurkers (and we also want to save in the JSON the MTurk worker / HIT / assignments IDs from Qualtrics).
For the YouGov survey, instead of dumping to file, we want to insert the retrieved tweets into the DB, so that they can be used for ranking for the Rockwell part. We need to make sure that the tweet collection from home timeline endpoint is short --- ideally should not take longer than 10 secs. This to avoid attrition later on. (This likely will require a separate issue.)

glciampaglia commented 2 years ago

The endpoint for the MTurk pilot is almost ready.

[x] For home timeline, we need to test that the code is able to fetch up to 800 results; we need someone following many accounts to give us their token or try the API itself.
[x] Collect user tweet timeline and count: 1) total retweets 2) total tweets 3) total retweets of tweets with NG links 4) total tweets with NG links
[x] In the home timeline need to exclude tweets by the authenticating user (these are fetched in user tweet timeline.

To discuss: is there a requirement that the endpoint runs in "real time"? Right now we are about 3s wait, but adding user tweet timeline will definitely increase the waiting time.

glciampaglia commented 2 years ago

We discussed the issue at the last bi-weekly meeting last week, and we decided that we will split the whole process into three separate calls to be done from Qualtrics. This means that the code that makes the requests to the Twitter API will be split into three separate functions.

glciampaglia commented 2 years ago

The endpoint is now working as three separate calls. Regarding the redirects, we figured out how to retrieve the response URL, so that can be implemented too. The other things still missing is the I/O. We will try to see if using a separate endpoint just for I/O solves the issue. Another alternative would be to write each tweets in a separate line. This could avoid a big I/O operation all at once. The only difference is that the file is in json-lines format instead.

glciampaglia commented 2 years ago

We finally fixed the issue with slow file I/O, now the endpoint takes only a few seconds to count all the tweets / likes. We still have a problem with resolving URLs from shortening services (e.g. cnn.it), we are getting undefined perhaps due to incomplete networking request. As a result, the counts are not accurate. Once are we able to fix this last hurdle, we should be able to close this issue.

Addendum: we also would like to return descriptive error messages. For the timeout case, we can say to please come back in 15 minutes.

glciampaglia commented 2 years ago

We tested the endpoint live with Brendan and we discovered that the paginator on the user tweet timeline was attempting to fetch all tweets in Brendan's timeline. This caused the instance to become unresponsive and had to be rebooted. We fixed this issue by adding limits as follows:

user timeline: 1000 tweets
likes: 2000 likes
home timeline: 800 tweets (this by default)

We still need to resolve shortened URLs, and once that is done we can close the issue.

glciampaglia commented 2 years ago

Limits have been added. The code for unshortening URLs still does not work so we for now we will close the issue and revisit in case we find a lot of tweets with shortened URLs.