bocchilorenzo / ntscraper

Scrape from Twitter using Nitter instances
MIT License
176 stars 29 forks source link

How to skip empty page #50

Open selihadji opened 10 months ago

selihadji commented 10 months ago

Hello, thank you very much for sharing this nice tool. However, I have an issue. I use Jupiter Notebook with the default settings (e.g., search a term using random instances). After a while, I get, e.g. 02-Jan-24 12:40:55 - Empty page on https://nitter.soopy.moe/

Since I would like to fetch a large number of tweets, I would like to skip this and restart from where it stopped. Is that possible?

AritzUMA commented 9 months ago

You can create a loop code using python to check if there are tweets in the variable where you insert the tweets and if the lenght is 0 start again.

BradKML commented 9 months ago

My current hack (also related to https://github.com/bocchilorenzo/ntscraper/issues/53)

!pip install ntscraper tqdm
from ntscraper import Nitter
from datetime import datetime, timedelta, date

scraper = Nitter(log_level=1, skip_instance_check=False)
user = "visakanv" # replace with the person you want to scrape
cache = []
date = date.today() # defaults to the day of query, please note for debugging
while True:
    dl_tweets = scraper.get_tweets(user, mode='user', until=str(date))['tweets']
    if len(dl_tweets) == 0: break # break if end of history is reached
    else: cache.extend(dl_tweets) # cache list concatenation
    date_str = dl_tweets[-1]['date'][:12] # get the date string to convert, forget the time and timezones
    date = datetime.strptime(date_str, '%b %d, %Y').date() + timedelta(days=1) # Get Twitter date format

Would convert the output into a DataFrame if I have the time but atm it seems to work