digitalmethodsinitiative / dmi-tcat

Digital Methods Initiative - Twitter Capture and Analysis Toolset
Apache License 2.0
365 stars 116 forks source link

Entities for 'retweeted_status' tweets are truncated compared to the tweet text #363

Closed brendam closed 5 years ago

brendam commented 5 years ago

We’ve just noticed that for button retweets you are only passing through those entities (hashtags, mentions, media) that are in the retweet’s ‘entities’ field. This is described in the comment here: https://github.com/digitalmethodsinitiative/dmi-tcat/blob/775002cfb01532c1eaf685969285204da9c18283/capture/common/functions.php#L1686-L1692

(I think in this you may have meant “>140" not “>280”.)

As a result, for retweets >140 characters, the hashtags, mentions, etc. towards the end of the tweet are not captured in the data TCAT produces.

You are using the 'full_text' from the 'retweeted_status': https://github.com/digitalmethodsinitiative/dmi-tcat/blob/775002cfb01532c1eaf685969285204da9c18283/capture/common/functions.php#L1675

It seems that after constructing this full retweet text, TCAT should also use the 'entities' from within the ‘retweeted_status’ object in the JSON. Otherwise the entities recorded by TCAT do not match the text you create at line 1675.

Additionally, however, there should also be a further entity for the ‘RT @user’ inserted at the start of the tweet – this could be taken from the first user_mention in the retweet’s top level 'entities' object. (The first user_mention in the retweet’s entities object will always reference the retweeted user.)

dentoir commented 5 years ago

Hi @brendam

Thanks for posting. The comment reads 280 character because it describes what should happen whenever an original tweet was just below the 280 character limit, but in a retweet (with RT: @someuser prepended to it) exceeds the 280 character limit. I believe the API then returns an object where the final entity (hashtag or mention, etc.) which is beyond the 280 character limit is not present in the main hierarchy, but is present in the retweet hierarchy. This reconstruction of the text behavior is old and I believe it was intended to closely mimic what was stored to what the end-user saw in the UI.

Just to be sure, are you describing the above issue (related to retweets increasing the length of the tweet and the entity being at the borderline), or a you flagging a broader issue? If so, is it possible to add a few code lines to illustrate how you thing this should be processed.

Best,

Emile

Snurb commented 5 years ago

Hi @dentoir,

@brendam and I worked together on this to determine TCAT's behaviour in relation to long button retweets. It might be worth working through a sample tweet to explain this. I'm using the mobile version of the tweet URLs here because they better demonstrate the issue - if you click on the links below, what's displayed better approximates the JSON payload:

  1. Here's a button retweet: https://mobile.twitter.com/Biggy1883/status/1123108949677318144. In the JSON as well as in the mobile display, this is truncated to

RT @1constitution: 'Speakers' for a new Federation @LindaBurneyMP Elder, Speaker Federation House @SenatorDodson Elder, Speaker Senate Tr…"

The JSON payload for this retweet is attached here as tweet_1123108949677318144.txt.

  1. Here's the original tweet that it retweets: https://mobile.twitter.com/1constitution/status/1120823903985618945. This shows the full text (267 characters, with additional hashtags, @mentions, and media in the 140-280 character range):

'Speakers' for a new Federation @LindaBurneyMP Elder, Speaker Federation House @SenatorDodson Elder, Speaker Senate Truth-Telling in Politics #Aunty #reconciliation #republic #repatriation @billshortenmp #fromtheheart @tanya_plibersek #manifesto #firstwoman #2o2o https://t.co/7S0piUuxHl

  1. Here's how TCAT processes the JSON, in capture/common/functions.php:

a. The JSON contains a retweeted_status object, so the if clause on line 1664 is triggered: we're dealing with a retweet.

b. The JSON does not contain a retweeted_status.text entity, so the first if/then option on line 1673 is triggered: $retweet_text = $data["retweeted_status"]["full_text"];. This means $retweet_text now contains the full text of the original tweet, quoted in 2. above.

c. Following the if/then clauses, on line 1684 TCAT constructs $store_text, from RT @1constitution: + $retweet_text. The full text of this is:

RT @1constitution: 'Speakers' for a new Federation @LindaBurneyMP Elder, Speaker Federation House @SenatorDodson Elder, Speaker Senate Truth-Telling in Politics #Aunty #reconciliation #republic #repatriation @billshortenmp #fromtheheart @tanya_plibersek #manifesto #firstwoman #2o2o https://t.co/7S0piUuxHl

d. However, and this is the problem we're highlighting: on lines 1802-3 TCAT populates $this->user_mentions and $this->hashtags from the JSON's entities.hashtags. This means it draws from the entities object of the (truncated) retweet, rather than from the entities of the (untruncated) tweet being retweeted. (TCAT does the same with URLs and media, but the code here is a little more complicated so I'm not going into detail on those ones.)

This means the reconstituted retweet from 3.c. above is assigned only the following entities from the JSON - no hashtags, URLs, or media, and only two of four mentions in the original tweet (plus a mention of the user being retweeted):

'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [{'screen_name': '1constitution',
    'name': 'Skutch',
    'id': 1414001184,
    'id_str': '1414001184',
    'indices': [3, 17]},
   {'screen_name': 'LindaBurneyMP',
    'name': 'Linda Burney MP',
    'id': 157861165,
    'id_str': '157861165',
    'indices': [51, 65]},
   {'screen_name': 'SenatorDodson',
    'name': 'Patrick Dodson',
    'id': 954263782070484992,
    'id_str': '954263782070484992',
    'indices': [99, 113]}],
  'urls': []},

But it should be assigned the following entities from retweeted_status.entities instead:

'entities': {'hashtags': [{'text': 'Aunty', 'indices': [144, 150]},
    {'text': 'reconciliation', 'indices': [151, 166]},
    {'text': 'republic', 'indices': [167, 176]},
    {'text': 'repatriation', 'indices': [177, 190]},
    {'text': 'fromtheheart', 'indices': [207, 220]},
    {'text': 'manifesto', 'indices': [239, 249]},
    {'text': 'firstwoman', 'indices': [250, 261]},
    {'text': '2o2o', 'indices': [262, 267]}],
   'symbols': [],
   'user_mentions': [{'screen_name': 'LindaBurneyMP',
     'name': 'Linda Burney MP',
     'id': 157861165,
     'id_str': '157861165',
     'indices': [32, 46]},
    {'screen_name': 'SenatorDodson',
     'name': 'Patrick Dodson',
     'id': 954263782070484992,
     'id_str': '954263782070484992',
     'indices': [80, 94]},
    {'screen_name': 'billshortenmp',
     'name': 'Bill Shorten',
     'id': 137198586,
     'id_str': '137198586',
     'indices': [192, 206]},
    {'screen_name': 'tanya_plibersek',
     'name': 'Tanya Plibersek',
     'id': 307755781,
     'id_str': '307755781',
     'indices': [221, 237]}],
   'urls': [],
   'media': [{'id': 1120823897375383553,
     'id_str': '1120823897375383553',
     'indices': [268, 291],
     'media_url': 'http://pbs.twimg.com/media/D433ZXLUIAEuwnH.jpg',
     'media_url_https': 'https://pbs.twimg.com/media/D433ZXLUIAEuwnH.jpg',
     'url': 'https://t.co/7S0piUuxHl',
     'display_url': 'pic.twitter.com/7S0piUuxHl',
     'expanded_url': 'https://twitter.com/1constitution/status/1120823903985618945/photo/1',
     'type': 'photo',
     'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
      'large': {'w': 1500, 'h': 845, 'resize': 'fit'},
      'medium': {'w': 1200, 'h': 676, 'resize': 'fit'},
      'small': {'w': 680, 'h': 383, 'resize': 'fit'}}}]},

Only this set of entities, from retweeted_status, contain all the hashtags/mentions/URLs etc. that are contained in the original tweet being retweeted.

I hope this makes sense.

And just to be clear, here's why this is important: a retweet like the one in our example might have been collected because we're tracking on a hashtag or keyword that occurs only in the 140-280 range of the tweet - in our case for example because it mentions @billshortenmp or contains the hashtag #fromtheheart.

However, because of the way TCAT currently processes the retweet's JSON, these mentions or hashtags are never associated with the retweet, because they occur only in retweeted_status.entities and not in the retweet's own entities.

This means that we collect the tweet, but that the datasets we export from TCAT and use in our analysis (usually by left-joining the fullExport table with the mentionsExport and hashtagExport tables) show this tweet to be a false positive - according to the TCAT data, it does not contain a mention of @billshortenmp or the hashtag #fromtheheart. And any count of the hashtags or mentions contained in the full dataset that is based on the mentions and hashtags data that TCAT generates will therefore underestimate the full volume, potentially by a factor of up to 50% (because the retweet's entities as the Twitter API delivers it covers only the first 140 characters of the retweet, while each retweeted post may be up to 280 characters long).

The solution to this will be to use the retweeted_status.entities / retweeted_status.extended_entities objects, as appropriate - I think this would need to happen somewhere around line 1684, where you construct the full retweet text for $store_text.

(For the mentions, the construction in $store_text complicates this a little more, because in addition to the mentions in the retweeted post the constructed $store_text also contains a mention (in the form of a retweet) of the original poster being retweeted). This should always be the first array item in the retweet's entities.user_mentions, so that item should be added to retweeted_status.entities.user_mentions to generate a list of all mentions in $store_text.)

Sorry about the long description of this issue, but I wanted to make sure the issue and its repercussions are clear...

Axel Bruns

brendam commented 5 years ago

Hello @dentoir,

I haven't used php for a while, so you should treat my code as psuedocode rather than something that will work - I suspect that's not the right way to add the extra user_mention to the front of the user_mention json. Here are my suggested code changes in an issue363 branch in my repository (not tested!):

https://github.com/brendam/dmi-tcat/commit/c3b6d4b1ae15617dc12cea2e8e7469966e04ba21

@Snurb pointed out that for completeness, the other thing that the code should do is shift the start and end character counts for all hashtags, mentions, URLs, etc. in the retweeted status up by the length of the RT @user: that's being inserted in front of the $retweet_text in this line:

https://github.com/digitalmethodsinitiative/dmi-tcat/blob/775002cfb01532c1eaf685969285204da9c18283/capture/common/functions.php#L1684

The position information doesn't get exported in the csv we use, so it isn't important to us - just if it is being used elsewhere?

Brenda.

dentoir commented 5 years ago

Hi @brendam & @Snurb

Thanks for the elaborate diagnosis, I will be looking into this tomorrow!

dentoir commented 5 years ago

Hi @brendam thanks for your code, I've modified the user_mentions concatenation syntax and pushed it directly to master after some testing. I'd appreciate further testing on your side very much.

I've added a comment about the character indexes, but that segment of JSON is not processed by TCAT at the moment and basically discarded.

dentoir commented 5 years ago

I've been on holiday and see there is no activity on this issue. I'll close it if you think the fix is sufficient.