JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.39k stars 702 forks source link

TwitterThreadScraper (twitter-thread) crashes with a TypeError #176

Closed hockeybro12 closed 3 years ago

hockeybro12 commented 3 years ago

First off, thanks for the great library, I've been using TwitterSearchScraper and it works great for me.

I'm trying to use TwitterThreadScraper now like so:

for i,tweet in enumerate(sntwitter.TwitterThreadScraper(tweetID=str('1343829375804977153')).get_items()):

However, when I run it I keep getting errors like Tweet does not exist, is not a thread, or does not have ancestors. The given tweet definitely exists (was posted by CNN), so I'm not sure what the problem is here. Looking at the source code, maybe Twitter changed the source page so that the BeautifulSoup search doesn't work anymore? Do you know how it can be fixed?

JSMboli commented 3 years ago

I used it just now and still works for me. I however did not use tweet id, I search for some terms and was able to retrieve date, id, username, content and url.

I also want to retrieve retweets and likes count but not sure how to do that. Thank you.

hockeybro12 commented 3 years ago

@JSMboli is that TwitterThreadScraper or TwitterSearchScraper that you used?

JustAnotherArchivist commented 3 years ago

The TwitterThreadScraper is indeed broken currently. I mentioned it briefly in a comment here but forgot to file an issue. Thanks!

hockeybro12 commented 3 years ago

@JustAnotherArchivist Thanks, do you have any idea when it may be fixed?

JustAnotherArchivist commented 3 years ago

No ETA at this time I'm afraid.

hockeybro12 commented 3 years ago

I see, is it a complex fix or is it something I could potentially do? Maybe you can provide a summary of what needs to be done and I can look into it?

JustAnotherArchivist commented 3 years ago

So there are a few different factors at play here.

  1. The scraper broke when I added extraction of the various extra information on tweets (like/reply/retweet/quote counts, source, media, etc.).
  2. The scraper is still based on the old web interface of Twitter.
  3. Your usage of the scraper never worked in the first place. The TwitterThreadScraper always went backwards in a thread, i.e. you'd start from some tweet and it'd retrieve the chain of tweets that led to this particular tweet. Recursing through replies to a tweet has never been supported, is a whole different can of worms to implement, and is tracked in #51.

This combination of 1 and 2 means that a proper fix is unfortunately not easy. I don't want to spend time trying to extract all this information from the old interface since that will just inevitably break again anyway once Twitter decides to shut it down. But reverse-engineering the new interface and implementing that API cruft is quite time-intensive.

As a workaround until I have time to fix this properly, you can use commit 588ec415 (not on Python 3.9+ though). This works but doesn't extract all the nice extra stuff, only the most basic information.

hockeybro12 commented 3 years ago

@JustAnotherArchivist I installed that commit like so: pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git@588ec415ffc169a46752e225fcc90eacea4e65cd

But I still get the same error running for i,tweet in enumerate(sntwitter.TwitterThreadScraper(tweetID=str('1343829375804977153')).get_items()):

I'm on python version 3.8.6. I'm ok with just getting the basic stuff.

JustAnotherArchivist commented 3 years ago

Yes, see point 3 above: TwitterThreadScraper goes backwards in a thread. It can't work for 1343829375804977153 because that tweet is not a reply. But if you plug in 1343836648753291264 for example, you should get three results.

hockeybro12 commented 3 years ago

I see, I did that and it does work. So basically in order to get the replies for a tweet, I can't do it? Or is there some hacky (maybe slow) way to do it?

JustAnotherArchivist commented 3 years ago

Looks like using the search scraper with a query like conversation_id:1343829375804977153 (filter:safe OR -filter:safe) works to some degree, but only for tweets that are not replies themselves and it might miss some replies due to the general search constraints.

hockeybro12 commented 3 years ago

@JustAnotherArchivist Oh, awesome! That actually works fine for me for now, I'm ok with missing some replies and I don't want tweets that are replies themselves (even if I do I could just backtrack them). Thanks for all the help!!! I'll leave the issue open for the future unless you want to close it.

JustAnotherArchivist commented 3 years ago

Happy to help. Yes, this should stay open until the thread scraper is either fixed or replaced.

JSMboli commented 3 years ago

Hi @JustAnotherArchivist, Is there a limit to number of tweets one can scrape when using snscrape?