Closed numeroteca closed 2 years ago
The full list of operators for the v2 API is here: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#list
Unfortunately, the operators are not consistent with the web interface -filter:nativeretweets
closest equivalent is is:retweet
.
To get what you want, the best query would be:
(from:newsmedia OR retweets_of:newsmedia)
Although you will also get tweets that @newsmedia retweets, which you may not want, but you can filter those yourself, or to get original tweets maybe try:
((from:newsmedia -is:retweet) OR retweets_of:newsmedia)
Quote tweets count as original tweets and are unaffected by -is:retweet
or retweets_of:
as far as i can tell (i haven't double checked this to make sure). See also, https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#quote-tweets note.
--exclude-replies
can be specified in the query too, as -is:reply
which is probably better to do, because it's more explicit and predictable - you may not want to add -is:reply
exactly where --exclude-replies
appends it, so i would not put it as a command line option, i would put it in the query where it should go.
To deal with quote tweets, you need to use url
operator, so to get all the times someone quote tweets an original @newsmedia tweet, you need url:"https://twitter.com/newsmedia/status"
because quote tweets are just ordinary tweets with a permalink url to another tweet inside somewhere.
So the full query:
((from:newsmedia -is:retweet -is:reply) OR retweets_of:newsmedia OR url:"https://twitter.com/newsmedia/statuses")
Will get all original newsmedia tweets that aren't retweets or replies by newsmedia, and any time someone else Retweets or Quotes any of their tweets.
Hope that helps!
Thanks for the quick response @igorbrigadir.
For my research I do want the tweets retweeted by the newsmedia. The purpose is to compare what media are tweeting (and how much) regarding a scandal to what the general public is tweeting.
This is the comparative queries I organized:
Tweets by eldiario:
twarc2 search --archive "from:eldiarioes" --start-time 2018-03-21T05:55:01 --end-time 2018-03-21T07:01:59 eldiarioes_00.json
9 tweets total
Tweets by eldiario OR retweets of eldiario:
twarc2 search --archive "from:eldiarioes OR retweets_of:eldiarioes" --start-time 2018-03-21T05:55:01 --end-time 2018-03-21T07:01:59 eldiarioes_01.json
954 tweets total
Tweets by eldiario OR retweets of eldiario OR tweets that are retweets and have links to eldiario tweets (this is redundant and provides the same result): twarc2 search --archive "from:eldiarioes OR retweets_of:eldiarioes OR url:twitter.com/newsmedia/status is:retweet" --start-time 2018-03-21T05:55:01 --end-time 2018-03-21T07:01:59 eldiarioes_02.json 954 tweets total
Tweets by eldiario OR retweets of eldiario OR tweets that are retweets and have links to eldiario twitter (this is more generic and gathers a few more tweets: twarc2 search --archive "from:eldiarioes OR url:twitter.com/eldiarioes is:retweet OR retweets_of:eldiarioes" --start-time 2018-03-21T05:55:01 --end-time 2018-03-21T07:01:59 eldiarioes_03.json 1014 tweets total
What I am still researching is if I will get with this query the retweets of tweets retweeted by newsmedia. I'd love to have this:
author | text | timestamp |
---|---|---|
@newsmedia | RT @anotheruser blablabla | - |
@oneuser | RT @newsmedia RT @anotheruser blablabla | - |
@otheruser | RT @newsmedia RT @anotheruser blablabla | - |
@onemoreuser | RT @newsmedia RT @anotheruser blablabla | - |
as a retweeted tweet has its own id and timestamp. But I guess I will get no mention to the first RT by newsmedia, and the result will be:
author | text | timestamp |
---|---|---|
@newsmedia | RT @anotheruser blablabla | 00:45:12 |
@oneuser | RT @anotheruser blablabla | 00:45:34 |
@otheruser | RT @anotheruser blablabla | 00:45:39 |
@onemoreuser | RT @anotheruser blablabla | 00:45:45 |
Is this last assumption correct? I'd expect to get the first RT by newsmedia (45:12) and all the RT of that RT. I'll come back with the results.
For my research I do want the tweets retweeted by the newsmedia.
Ah ok, in that case it's better to get all tweets, and not specify is:retweet
at all.
This query
"from:eldiarioes OR retweets_of:eldiarioes OR url:twitter.com/newsmedia/status is:retweet"
Is ambiguous because of the operator precedence, it may break and give you weird results if you add other operators - you're mixing implicit AND and OR operators, any time you have more than 1 operator, use parentheses ()
so this reads like: "from:eldiarioes OR retweets_of:eldiarioes OR url:twitter.com/newsmedia/status AND is:retweet"
same with this one
from:eldiarioes OR url:twitter.com/eldiarioes is:retweet OR retweets_of:eldiarioes
You are missing ()
parentheses so it read like from:eldiarioes OR url:twitter.com/eldiarioes AND is:retweet OR retweets_of:eldiarioes
The broadest query would be:
(from:eldiarioes OR retweets_of:eldiarioes OR url:"https://twitter.com/eldiarioes")
Unfortunately, it is not possible to get "retweets or retweets" - it is impossible to know this because the API and twitter do not provide this data, you can only infer this later, by maybe analyzing the friend / follower network but this is an entirely different (and very non trivial) problem. And yes, retweets always only contain the original author, and the retweeting account, never anything that came between.
BTW, twarc2 counts
is great for testing:
twarc2 counts --archive '(from:eldiarioes OR retweets_of:eldiarioes OR url:"https://twitter.com/eldiarioes")' --start-time "2018-03-21T05:55:01" --end-time "2018-03-21T07:01:59" --text
Total Tweets: 1,142
So the parentheses make all work together.
For a given query xxx OR yyy OR zzz AND jjj
, how are they grouped to decide what to query?
a. (xxx OR yyy OR zzz) AND ggg
or
b. (xxx OR yyy) OR (zzz AND ggg)
or?
Indeed twarc2 counts
is awesome!
In that case, you would get the second - Twitter applies the AND operator first as specified in the docs: https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#order-of-operations
If in doubt it's best to throw in brackets to make it explicit and more readable.
(This is also something that changed from V1.1 to V2, the OR used to have priority)
Can this be closed?
Yes, I managed to do it with your advice. Thanks again!
For the record, I used these scripts, working just with twarc2 counts
, to download the amount of tweets of every news media I am studying. I realized I didn't need the real tweets (though in parallel I am analyzing the tweets to check these results).
All tweets by news media
twarc2 counts --archive '(from:elconfidencial)' --start-time 2018-03-20T00:00:00 --end-time 2018-05-01T23:59:59 --csv --granularity day > newsmedia.csv
All tweets by news media and search topic
twarc2 counts --archive '(from:elconfidencial cifuentes)' --start-time 2018-03-20T00:00:00 --end-time 2018-05-01T23:59:59 --csv --granularity day > newsmedia_topic.csv
All tweets by news media and their RT
twarc2 counts --archive '(from:elconfidencial OR retweets_of:elconfidencial)' --start-time 2018-03-20T00:00:00 --end-time 2018-05-01T23:59:59 --csv --granularity day > newsmedia-RT.csv
All tweets by news media and their RT and search topic
twarc2 counts --archive '( (from:elconfidencial OR retweets_of:elconfidencial) cifuentes)' --start-time 2018-03-20T00:00:00 --end-time 2018-05-01T23:59:59 --csv --granularity day > newsmedia-RT_topic.csv
Then with this other R script I processed all the data and generated visualizations like these ones:
For 1 news media:
Now I calculate the percentage for this news media in two ways:
All the news media together:
I also replicated these analysis by the hour and calculated their correlations.
I am trying to get all the tweets (and their native retweets) tweeted by a new media account. I want to measure the impact of their tweets in time. I expect to produce this type of list of tweets to make data visualizations
So I'd need to filter to have the original tweets by the news media and the "nativeretweet" of those tweets
This is the query I am trying, that is not yet ready. I'd still need to add the filter to return only native retweets (like
-filter:nativeretweets
) and include the original tweets by the news media user:twarc2 search 'url:"twitter.com/eldiarioes"' --exclude-replies --start-time 2018-03-21T00:00:01 --end-time 2018-03-21T23:59:59 --archive > 2018-03-21_eldiario_rts.json