JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.31k stars 698 forks source link

Support for excluding replies #45

Closed ealgase closed 5 years ago

ealgase commented 5 years ago

It would be useful to be able to get a list of posts that aren't replies (to other people). Threads should still be included, though.

JustAnotherArchivist commented 5 years ago

This is possible with the twitter-search scraper:

snscrape twitter-search 'from:textfiles exclude:replies'

However, this also excludes (the second and later posts of) threads. I think it's not possible to exclude replies to other people but include threads without first collecting all tweets and then filtering; a self-reply is not always a thread since first tweet of a run may be a response to another user. But collecting all tweets may be quite memory-intensive for large scrapes – I've run scrapes with millions of results before – so I won't add that.

But the problems go further than that: while the Twitter HTML indicates whether a tweet is a reply or not and also has a list of the replied-to users, it does not contain a reference to the replied-to tweet. So it's impossible to match the tweets together. snscrape would have to request the page for each tweet individually, which would slow down the scrape by at least one or two orders of magnitude...