ferguras / twitter-analysis

Scrape the Twitter Frontend API without authentication.
MIT License
25 stars 4 forks source link

Retrieving more results #3

Closed yosh7 closed 5 years ago

yosh7 commented 6 years ago

Kudos for the project - really helpful, the only issue is the limit of 450 tweets (I managed to get to 600+ in some cases). It appears the limitation is in the number of pages that can be retrieved, eventually ending with following error:

  File "C:/pyprojects/twitter-cluster/twitter_analysis.py", line 123, in <module>
    for tweet in get_tweets('realDonaldTrump', tweets=10000,maxpages=1000):
  File "C:/pyprojects/twitter-cluster/twitter_analysis.py", line 119, in get_tweets
    yield from gen_tweets(tweets, retweets, notext, adddot, maxpages)
  File "C:/pyprojects/twitter-cluster/twitter_analysis.py", line 27, in gen_tweets
    url='bunk', default_encoding='utf-8')
  File "C:\pyprojects\twitter-cluster\venv\lib\site-packages\requests_html.py", line 419, in __init__
    element=PyQuery(html)('html') or PyQuery(f'<html>{html}</html>')('html'),
  File "C:\pyprojects\twitter-cluster\venv\lib\site-packages\pyquery\pyquery.py", line 255, in __init__
    elements = fromstring(context, self.parser)
  File "C:\pyprojects\twitter-cluster\venv\lib\site-packages\pyquery\pyquery.py", line 99, in fromstring
    result = getattr(lxml.html, meth)(context)
  File "C:\pyprojects\twitter-cluster\venv\lib\site-packages\lxml\html\__init__.py", line 876, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "C:\pyprojects\twitter-cluster\venv\lib\site-packages\lxml\html\__init__.py", line 765, in document_fromstring
    "Document is empty")
lxml.etree.ParserError: Document is empty

When I try to do an endless scroll manually in twitter, I managed to get deeper into the history which insinuates that there might be a solution to the issue. would be happy to assist (although I did not manage to solve this by myself yet...)

ferguras commented 6 years ago

My guess is that the manual scroll uses a different url to get its data. Maybe we could fix it of you are able to figure out how that works.

The error could be catched but that doesn't solve your issue. If you require the tweets from Trump, they are available on the internet. Another solution I made requires the tweetid of the tweets you need.

yosh7 commented 6 years ago

When I double checked I found that the web version fails exactly at the same point as the scrapper, so unfortunately this is a barrier we can't get through. My tweets are not limited to Trump, but if you can point me to a DB with 1000s of tweets from the same user (possibly Trump, but any celebrity/politician will do) that would be awsome !

ferguras commented 6 years ago

Trump Twitter archive: http://www.trumptwitterarchive.com/archive

And maybe you are able to find more on other persons if you search for Twitter archive.

I'll add a catch for the error so that at least you are able to get the max amount available.

jonashaag commented 6 years ago

Same here