DocNow / twarc

A command line tool (and Python library) for archiving Twitter JSON
https://twarc-project.readthedocs.io
MIT License
1.36k stars 255 forks source link

How to get only native retweets with a search filter? #593

Closed numeroteca closed 2 years ago

numeroteca commented 2 years ago

I am trying to get all the tweets (and their native retweets) tweeted by a new media account. I want to measure the impact of their tweets in time. I expect to produce this type of list of tweets to make data visualizations

author text timestamp
@newsmedia Original tweet by newsmedia -
@oneuser RT @newsmedia Original tweet by newsmedia -
@otheruser RT @newsmedia Original tweet by newsmedia -
@newsmedia Another Original tweet by newsmedia -
@onemoreuser RT @newsmedia Another Original tweet by newsmedia -

So I'd need to filter to have the original tweets by the news media and the "nativeretweet" of those tweets

This is the query I am trying, that is not yet ready. I'd still need to add the filter to return only native retweets (like -filter:nativeretweets) and include the original tweets by the news media user:

twarc2 search 'url:"twitter.com/eldiarioes"' --exclude-replies --start-time 2018-03-21T00:00:01 --end-time 2018-03-21T23:59:59 --archive > 2018-03-21_eldiario_rts.json

igorbrigadir commented 2 years ago

The full list of operators for the v2 API is here: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#list

Unfortunately, the operators are not consistent with the web interface -filter:nativeretweets closest equivalent is is:retweet.

To get what you want, the best query would be:

(from:newsmedia OR retweets_of:newsmedia)

Although you will also get tweets that @newsmedia retweets, which you may not want, but you can filter those yourself, or to get original tweets maybe try:

((from:newsmedia -is:retweet) OR retweets_of:newsmedia)

Quote tweets count as original tweets and are unaffected by -is:retweet or retweets_of: as far as i can tell (i haven't double checked this to make sure). See also, https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#quote-tweets note.

--exclude-replies can be specified in the query too, as -is:reply which is probably better to do, because it's more explicit and predictable - you may not want to add -is:reply exactly where --exclude-replies appends it, so i would not put it as a command line option, i would put it in the query where it should go.

To deal with quote tweets, you need to use url operator, so to get all the times someone quote tweets an original @newsmedia tweet, you need url:"https://twitter.com/newsmedia/status" because quote tweets are just ordinary tweets with a permalink url to another tweet inside somewhere.

So the full query:

((from:newsmedia -is:retweet -is:reply) OR retweets_of:newsmedia OR url:"https://twitter.com/newsmedia/statuses")

Will get all original newsmedia tweets that aren't retweets or replies by newsmedia, and any time someone else Retweets or Quotes any of their tweets.

Hope that helps!

numeroteca commented 2 years ago

Thanks for the quick response @igorbrigadir.

For my research I do want the tweets retweeted by the newsmedia. The purpose is to compare what media are tweeting (and how much) regarding a scandal to what the general public is tweeting.

This is the comparative queries I organized:

Tweets by eldiario: twarc2 search --archive "from:eldiarioes" --start-time 2018-03-21T05:55:01 --end-time 2018-03-21T07:01:59 eldiarioes_00.json 9 tweets total

Tweets by eldiario OR retweets of eldiario: twarc2 search --archive "from:eldiarioes OR retweets_of:eldiarioes" --start-time 2018-03-21T05:55:01 --end-time 2018-03-21T07:01:59 eldiarioes_01.json 954 tweets total

Tweets by eldiario OR retweets of eldiario OR tweets that are retweets and have links to eldiario tweets (this is redundant and provides the same result): twarc2 search --archive "from:eldiarioes OR retweets_of:eldiarioes OR url:twitter.com/newsmedia/status is:retweet" --start-time 2018-03-21T05:55:01 --end-time 2018-03-21T07:01:59 eldiarioes_02.json 954 tweets total

Tweets by eldiario OR retweets of eldiario OR tweets that are retweets and have links to eldiario twitter (this is more generic and gathers a few more tweets: twarc2 search --archive "from:eldiarioes OR url:twitter.com/eldiarioes is:retweet OR retweets_of:eldiarioes" --start-time 2018-03-21T05:55:01 --end-time 2018-03-21T07:01:59 eldiarioes_03.json 1014 tweets total

What I am still researching is if I will get with this query the retweets of tweets retweeted by newsmedia. I'd love to have this:

author text timestamp
@newsmedia RT @anotheruser blablabla -
@oneuser RT @newsmedia RT @anotheruser blablabla -
@otheruser RT @newsmedia RT @anotheruser blablabla -
@onemoreuser RT @newsmedia RT @anotheruser blablabla -

as a retweeted tweet has its own id and timestamp. But I guess I will get no mention to the first RT by newsmedia, and the result will be:

author text timestamp
@newsmedia RT @anotheruser blablabla 00:45:12
@oneuser RT @anotheruser blablabla 00:45:34
@otheruser RT @anotheruser blablabla 00:45:39
@onemoreuser RT @anotheruser blablabla 00:45:45

Is this last assumption correct? I'd expect to get the first RT by newsmedia (45:12) and all the RT of that RT. I'll come back with the results.

igorbrigadir commented 2 years ago

For my research I do want the tweets retweeted by the newsmedia.

Ah ok, in that case it's better to get all tweets, and not specify is:retweet at all.

This query

"from:eldiarioes OR retweets_of:eldiarioes OR url:twitter.com/newsmedia/status is:retweet"

Is ambiguous because of the operator precedence, it may break and give you weird results if you add other operators - you're mixing implicit AND and OR operators, any time you have more than 1 operator, use parentheses () so this reads like: "from:eldiarioes OR retweets_of:eldiarioes OR url:twitter.com/newsmedia/status AND is:retweet"

same with this one

from:eldiarioes OR url:twitter.com/eldiarioes is:retweet OR retweets_of:eldiarioes

You are missing () parentheses so it read like from:eldiarioes OR url:twitter.com/eldiarioes AND is:retweet OR retweets_of:eldiarioes

The broadest query would be:

(from:eldiarioes OR retweets_of:eldiarioes OR url:"https://twitter.com/eldiarioes")

Unfortunately, it is not possible to get "retweets or retweets" - it is impossible to know this because the API and twitter do not provide this data, you can only infer this later, by maybe analyzing the friend / follower network but this is an entirely different (and very non trivial) problem. And yes, retweets always only contain the original author, and the retweeting account, never anything that came between.

BTW, twarc2 counts is great for testing:

twarc2 counts --archive '(from:eldiarioes OR retweets_of:eldiarioes OR url:"https://twitter.com/eldiarioes")' --start-time "2018-03-21T05:55:01" --end-time "2018-03-21T07:01:59" --text

Total Tweets: 1,142

numeroteca commented 2 years ago

So the parentheses make all work together. For a given query xxx OR yyy OR zzz AND jjj, how are they grouped to decide what to query?

a. (xxx OR yyy OR zzz) AND ggg or b. (xxx OR yyy) OR (zzz AND ggg) or?

Indeed twarc2 counts is awesome!

SamHames commented 2 years ago

In that case, you would get the second - Twitter applies the AND operator first as specified in the docs: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#order-of-operations

If in doubt it's best to throw in brackets to make it explicit and more readable.

(This is also something that changed from V1.1 to V2, the OR used to have priority)

edsu commented 2 years ago

Can this be closed?

numeroteca commented 2 years ago

Yes, I managed to do it with your advice. Thanks again!

For the record, I used these scripts, working just with twarc2 counts, to download the amount of tweets of every news media I am studying. I realized I didn't need the real tweets (though in parallel I am analyzing the tweets to check these results).

All tweets by news media twarc2 counts --archive '(from:elconfidencial)' --start-time 2018-03-20T00:00:00 --end-time 2018-05-01T23:59:59 --csv --granularity day > newsmedia.csv

All tweets by news media and search topic twarc2 counts --archive '(from:elconfidencial cifuentes)' --start-time 2018-03-20T00:00:00 --end-time 2018-05-01T23:59:59 --csv --granularity day > newsmedia_topic.csv

All tweets by news media and their RT twarc2 counts --archive '(from:elconfidencial OR retweets_of:elconfidencial)' --start-time 2018-03-20T00:00:00 --end-time 2018-05-01T23:59:59 --csv --granularity day > newsmedia-RT.csv

All tweets by news media and their RT and search topic twarc2 counts --archive '( (from:elconfidencial OR retweets_of:elconfidencial) cifuentes)' --start-time 2018-03-20T00:00:00 --end-time 2018-05-01T23:59:59 --csv --granularity day > newsmedia-RT_topic.csv

Then with this other R script I processed all the data and generated visualizations like these ones:

For 1 news media: cifuentes-00-4panel_cifuentes-elconfidencial

cifuentes-01-n-tweets-only-newsmedia_cifuentes-elconfidencial

Now I calculate the percentage for this news media in two ways:

cifuentes-03-percent_cifuentes-elconfidencial

All the news media together:

cifuentes-news-media-tweets_from-total_twarc

cifuentes-news-media-tweets-twarc_colored

cifuentes-news-media-tweets-with-RT-twarc_colored

I also replicated these analysis by the hour and calculated their correlations.