Closed glciampaglia closed 2 years ago
We discussed this issue, we have a first implementation in node.JS. We need to use v1.1 endpoint of home timeline with both exclude_replies
and include_entities
set to true, trim_user
set to false, and count
set to 200. In total we want to get up to 800 tweets, so we will need to make at least 4 requests per user. Subsequent requests after the first one, need to include since_id
parameter too.
We have implemented the v1.1 endpoint request based on the above.
For the MTurk pilot, we will compute the stats needed for eligibility, and then dump the tweets into a json file, keyed by Twitter user_id
of MTurkers (and we also want to save in the JSON the MTurk worker / HIT / assignments IDs from Qualtrics).
For the YouGov survey, instead of dumping to file, we want to insert the retrieved tweets into the DB, so that they can be used for ranking for the Rockwell part. We need to make sure that the tweet collection from home timeline endpoint is short --- ideally should not take longer than 10 secs. This to avoid attrition later on. (This likely will require a separate issue.)
The endpoint for the MTurk pilot is almost ready.
To discuss: is there a requirement that the endpoint runs in "real time"? Right now we are about 3s wait, but adding user tweet timeline will definitely increase the waiting time.
We discussed the issue at the last bi-weekly meeting last week, and we decided that we will split the whole process into three separate calls to be done from Qualtrics. This means that the code that makes the requests to the Twitter API will be split into three separate functions.
The endpoint is now working as three separate calls. Regarding the redirects, we figured out how to retrieve the response URL, so that can be implemented too. The other things still missing is the I/O. We will try to see if using a separate endpoint just for I/O solves the issue. Another alternative would be to write each tweets in a separate line. This could avoid a big I/O operation all at once. The only difference is that the file is in json-lines format instead.
We finally fixed the issue with slow file I/O, now the endpoint takes only a few seconds to count all the tweets / likes. We still have a problem with resolving URLs from shortening services (e.g. cnn.it
), we are getting undefined
perhaps due to incomplete networking request. As a result, the counts are not accurate. Once are we able to fix this last hurdle, we should be able to close this issue.
Addendum: we also would like to return descriptive error messages. For the timeout case, we can say to please come back in 15 minutes.
We tested the endpoint live with Brendan and we discovered that the paginator on the user tweet timeline was attempting to fetch all tweets in Brendan's timeline. This caused the instance to become unresponsive and had to be rebooted. We fixed this issue by adding limits as follows:
We still need to resolve shortened URLs, and once that is done we can close the issue.
Limits have been added. The code for unshortening URLs still does not work so we for now we will close the issue and revisit in case we find a lot of tweets with shortened URLs.
We need an endpoint that can be queried and returns true/false based on the following:
First, we need to estimate the XXXs via MTurk pilot, so at the first the endpoint should just collect these data.