JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.4k stars 703 forks source link

Getting different results than with Tweepy #162

Closed ajc356 closed 3 years ago

ajc356 commented 3 years ago

I’m using snscrape instead of tweepy in order to get more than 7 days worth of tweets (I want back till March 1) . However, to cross check the quality of snscrape, I’ve compared the number of tweets for the past 7 days using the exact same query terms and I get very different results in terms of quantity. 5,000 tweets from Tweepy. 12,000 tweets from snscrape. This makes me think that snscrape is not removing retweets, despite my query terms - perhaps snscrape simply is not identifying/cannot identify retweets? If that’s indeed the case, have you found a hack for removing retweets manually? (data for retweetedTweet came back all null) FWIW - command line query for snscrape snscrape --jsonl --max-results 2000000 --since 2020-11-13 twitter-search '((CRB or #CRB) -clubes -coupe -SuperCoupe -SUPER_COUPE -football -belouizdad -chabab -USMA -brasileño -Brasileirão -Algérie -Railway -@IR_CRB -@RailMinIndia -@PiyushGoyal) OR CERB OR #CERB OR (#EI (Canada OR #onpoli OR #cdnpoli)) OR (Canada unemployment) OR ((#CRA or cra) (#covid19canada OR #covid19relief OR covid OR covid19 OR coronavirus or pandemic)) -filter:retweets' > cerb_no_retweets_11-13.json Tweepy query cerb_tw = '((CRB or #CRB) -clubes -coupe -SuperCoupe -SUPER_COUPE -football -belouizdad -chabab -USMA -brasileño -Brasileirão -Algérie -Railway -@IR_CRB -@RailMinIndia -@PiyushGoyal) OR CERB OR #CERB OR (#EI (Canada OR #onpoli OR #cdnpoli)) OR (Canada unemployment) OR ((#CRA or cra) (#covid19canada OR #covid19relief OR covid OR covid19 OR coronavirus or pandemic)) -filter:retweets' (this returns most recent weeks worth of data, back to Nov 13)

JustAnotherArchivist commented 3 years ago

Well, retweetedTweet being None (or null in JSON) means they definitely aren't retweets. The Twitter search never returns retweets unless you explicitly request them using include:nativeretweets (which only includes retweets from the past 7 days). Cf. #8 and issues referenced there.

I have no experience with Tweepy and can't test it currently. Can you compare the returned tweets and figure out which are missing from Tweepy's output?

ajc356 commented 3 years ago

I've made a list of tweet ids from both datasets and then compared the content of them and found some pretty confusing discrepancies. I would have expected Tweepy to contain at the majority of the ids from snscrape's results, however, they only share 1,799 of the same ids. Even if snscrape is pulling in more data than tweepy, I can't understand why the overlap is so limited given the exact same searches.


sns_ids = list(mytest22['id'].dropna().astype(int)) #snscrape results 
twp_ids = list(twp_1113['id'].dropna().astype(int)) #tweepy results 
print("num tweet ids from snscrape:", len(sns_ids)) 
print("num tweet ids from tweepy:", len(twp_ids))
print("difference:", (len(sns_ids) - len(twp_ids)))

num tweet ids from snscrape: 12552 num tweet ids from tweepy: 5162 difference: 7390

discrep = [i for i in sns_ids if i not in twp_ids]
len(discrep)

10753

overlap = [j for j in sns_ids if j in twp_ids]
len(overlap)

1799

ajc356 commented 3 years ago

Follow up question regarding re-tweets specifically - how would I be able to identify retweets in the snscrape results to confirm that there really are none present? It does not seem like there includes of any kind of RT label in front of the text, but is that correct? Or do you have any other tips for filtering? I continue to be skeptical about whether or not my results are containing retweets b/c I am getting tweets with the same content but unique id numbers.

JustAnotherArchivist commented 3 years ago

Interesting. Your search query is quite complex. Perhaps it's an encoding issue in either snscrape or Tweepy, or the Twitter API works differently than the search on the web interface. Do you also see large differences on simple queries, e.g. a single word or hashtag?

Regarding retweets, I simply checked for retweetedTweet being None. This happens iff Twitter returns no retweeted_status_id_str (code), so it should always be accurate for native retweets, which were introduced in 2009. I'd only expect it to break down on older tweets or when people manually repost tweet contents (which is rare nowadays with the native retweets and quotes). But of course there are many tweets with the same contents that are not retweets. Bots, memes, short replies, etc. all contribute to that. There's no way to generically detect that.

JustAnotherArchivist commented 3 years ago

Closing as non-actionable. As far as I can tell, snscrape works as correctly as Twitter allows given the constraint of using the unauthenticated web interface rather than the API (which is used by Tweepy) – differences between the two are not entirely unexpected and constitute either bugs or intentional design limitations on Twitter's side.

Please feel free to reopen if you find any signs that this is untrue.

Mitthoon commented 2 years ago

Closing as non-actionable. As far as I can tell, snscrape works as correctly as Twitter allows given the constraint of using the unauthenticated web interface rather than the API (which is used by Tweepy) – differences between the two are not entirely unexpected and constitute either bugs or intentional design limitations on Twitter's side.

Please feel free to reopen if you find any signs that this is untrue.

How will I be able to get retweets and replies of whole panel discussion in twitter? Is it possible with this library. Currently my project is about working on tweets about covid-19 topics and I am required to get quoted tweets and all replies/convos of the original tweet that I scraped off. Is there a way to workaround? looking forward to your reply.