bocchilorenzo / ntscraper

Scrape from Twitter using Nitter instances
MIT License
182 stars 29 forks source link

Error in getting information from Nitter instances #5

Closed redfalcoon closed 1 year ago

redfalcoon commented 1 year ago

Hi!

starting from today the scraping doesn't work correctly returning the error messages: Traceback (most recent call last): File "CrawlerNitter.py", line 7, in user_information = scraper.get_profile_info(user) File "/usr/local/lib/python3.6/site-packages/ntscraper/nitter.py", line 587, in get_profile_info is_encrypted = self.is_instance_encrypted(instance) File "/usr/local/lib/python3.6/site-packages/ntscraper/nitter.py", line 43, in is_instance_encrypted instance_new, soup = self.__get_page("/Twitter", instance) File "/usr/local/lib/python3.6/site-packages/ntscraper/nitter.py", line 106, in __get_page if soup.findall("div", class="show-more")[-1].find("a").text == "Load newest": IndexError: list index out of range

this happens using: user_information = scraper.get_profile_info(user) but also using: user_tweets = scraper.get_tweets(user, mode='user', number=posts)

until yesterday the process of crawling worked fine. The instance used for the test was the https://nitter.net (the most recent updated) with Version [2023.07.24-20b5cce]

Any suggestion?

Thanks

psegovias commented 1 year ago

i am facing the same error, what i can see is that twitter broke several operators in its search.

muenze8 commented 1 year ago

Hello

I am having the same issue. Nitter UI returns resutls for the queries I am passing in the code, but the ntscraper keeps showing "error fetching from instances" .... and yes, passing the same search query into twitter doesn't return anything too!!

image
psegovias commented 1 year ago

Twitter operator OR & AND are broken, but i cant identify who this broke the scraper.

psegovias commented 1 year ago

I found the error, incoming pull request @redfalcoon @muenze8

redfalcoon commented 1 year ago

Patricio,

using your patch the user_information = scraper.get_profile_info(user) works fine, not all for user_tweets = scraper.get_tweets(user, mode='user', number=posts) that generate the same error.

Thanks

psegovias commented 1 year ago

@redfalcoon paste your code to try replicate in my env.

redfalcoon commented 1 year ago

Patricio

this is the source that I use:

import sys import json from ntscraper import Nitter scraper = Nitter(log_level=None)

server=sys.argv[1] user='elonmusk' posts=100 user_information = scraper.get_profile_info(user) print(json.dumps(user_information,indent=4)) user_tweets = scraper.get_tweets(user, mode='user', number=posts, instance=server) print(json.dumps(user_tweets,indent=4))

To execute the code I use:

python3 CrawlerNitter.py https://nitter.net

Traceback (most recent call last): File "CrawlerNitter.py", line 9, in user_information = scraper.get_profile_info(user) File "/usr/local/lib/python3.6/site-packages/ntscraper/nitter.py", line 587, in get_profile_info is_encrypted = self.is_instance_encrypted(instance) File "/usr/local/lib/python3.6/site-packages/ntscraper/nitter.py", line 43, in is_instance_encrypted instance_new, soup = self.__get_page("/x", instance) File "/usr/local/lib/python3.6/site-packages/ntscraper/nitter.py", line 106, in __get_page if soup.findall("div", class="show-more")[-1].find("a").text == "Load Newest": IndexError: list index out of range

psegovias commented 1 year ago

i got the same result with user='elonmusk' then i try user='jack' and success.

{ "image": "https://pbs.twimg.com/profile_images/1661201415899951105/azNjKOSH_400x400.jpg", "name": "jack", "username": "@jack", "id": "12", "bio": "#bitcoin and chill.\n\n#nostr: npub1sg6plzptd64u62a878hep2kev88swjh3tw00gjsfl8f237lmu63q0uf63m", "location": "", "website": "https://primal.net/jack", "joined": "8:50 PM - 21 Mar 2006", "stats": { "tweets": 29315, "following": 4658, "followers": 6523202, "likes": 35955, "media": 2917 } } 27-Jul-23 10:49:22 - Empty profile on http://localhost, trying another random instance 27-Jul-23 10:49:23 - Error fetching https://nitter.onthescent.xyz/, trying another random instance 27-Jul-23 10:49:25 - Error fetching https://twitter.owacon.moe, trying another random instance { "tweets": [ { "link": "https://twitter.com/sza/status/1684461977647845377#m", "text": "DND: a lifestyle", "user": { "name": "SZA", "username": "@sza", "avatar": "https://pbs.twimg.com/profile_images/1600953582605328384/t5skIcVh_bigger.jpg" }, "date": "Jul 27, 2023 \u00b7 7:13 AM UTC", "is-retweet": true, "external-link": "", "quoted-post": {}, "stats": { "comments": 0, "retweets": 20415, "quotes": 0, "likes": 42522 }, "pictures": [], "videos": [], "gifs": [] }, { "link": "https://twitter.com/lilyallen/status/1684338954550616069#m", "text": "The world really did a number on Sinead O\u2019Oconnor. RIP Angel.", "user": { "name": "Lily Allen", "username": "@lilyallen", "avatar": "https://pbs.twimg.com/profile_images/1562055589584404483/C5v4nJvC_bigger.jpg" }, "date": "Jul 26, 2023 \u00b7 11:04 PM UTC", "is-retweet": true, "external-link": "", "quoted-post": {}, "stats": { "comments": 158, "retweets": 623, "quotes": 0, "likes": 17847 }, "pictures": [], "videos": [], "gifs": [] }, { "link": "https://twitter.com/jack/status/1684455493333557251#m", "text": "https://piped.video/watch?v=DOiDUbaBL9E", "user": { "name": "jack", "username": "@jack", "avatar": "https://pbs.twimg.com/profile_images/1661201415899951105/azNjKOSH_bigger.jpg" }, "date": "Jul 27, 2023 \u00b7 6:47 AM UTC", "is-retweet": false, "external-link": "", "quoted-post": {}, "stats": { "comments": 90, "retweets": 179, "quotes": 0, "likes": 880 }, "pictures": [], "videos": [], "gifs": [] },etc....

redfalcoon commented 1 year ago

Patricio,

I've changed the user in jack and the result change a little bit, this is the output

python3 CrawlerNitter.py https://nitter.net 27-Jul-23 16:59:05 - Error fetching https://nitter.unixfox.eu, trying another random instance 27-Jul-23 16:59:07 - Error fetching https://nitter.tokhmi.xyz, trying another random instance { "image": "https://pbs.twimg.com/profile_images/1661201415899951105/azNjKOSH_400x400.jpg", "name": "jack", "username": "@jack", "id": "12", "bio": "#bitcoin and chill.\n\n#nostr: npub1sg6plzptd64u62a878hep2kev88swjh3tw00gjsfl8f237lmu63q0uf63m", "location": "", "website": "https://primal.net/jack", "joined": "8:50 PM - 21 Mar 2006", "stats": { "tweets": 29315, "following": 4658, "followers": 6523202, "likes": 35955, "media": 2917 } } 27-Jul-23 16:59:15 - Error fetching https://nitter.net, trying another random instance Traceback (most recent call last): File "CrawlerNitter.py", line 11, in user_tweets = scraper.get_tweets(user, mode='user', number=posts, instance=server) File "/usr/local/lib/python3.6/site-packages/ntscraper/nitter.py", line 572, in get_tweets return self.search(term, mode, number, since, until, max_retries, instance) File "/usr/local/lib/python3.6/site-packages/ntscraper/nitter.py", line 491, in __search to_append = self.extract_tweet(tweet, is_encrypted) File "/usr/local/lib/python3.6/site-packages/ntscraper/nitter.py", line 409, in extract_tweet if quoted_tweet File "/usr/local/lib/python3.6/site-packages/ntscraper/nitter.py", line 345, in get_tweet_link return "https://twitter.com" + tweet.find("a")["href"] TypeError: 'NoneType' object is not subscriptable

I was able to read the info related to the profile not the posts, I've changed the nitter.py to print the instance used and I've verified that not all the instances selected randomically produce right results as you see in the previous log of this text. Using this patch (only write the instance used) I've got the following result: python3 CrawlerNitter.py https://nitter.net 27-Jul-23 17:06:02 - https://ntr.frail.duckdns.org unreachable, trying another random instance 27-Jul-23 17:06:11 - https://nitter.inpt.fr unreachable, trying another random instance https://nitter.riverside.rocks https://nitter.riverside.rocks { "image": "https://pbs.twimg.com/profile_images/1661201415899951105/azNjKOSH_400x400.jpg", "name": "jack", "username": "@jack", "id": "12", "bio": "#bitcoin and chill.\n\n#nostr: npub1sg6plzptd64u62a878hep2kev88swjh3tw00gjsfl8f237lmu63q0uf63m", "location": "", "website": "https://primal.net/jack", "joined": "8:50 PM - 21 Mar 2006", "stats": { "tweets": 29315, "following": 4658, "followers": 6523212, "likes": 35955, "media": 2917 } } https://nitter.net 27-Jul-23 17:06:17 - Error fetching https://nitter.net, trying another random instance https://nitter.freedit.eu https://nitter.freedit.eu https://nitter.freedit.eu https://nitter.freedit.eu https://nitter.freedit.eu https://nitter.freedit.eu 27-Jul-23 17:06:34 - Empty profile on https://nitter.freedit.eu, trying another random instance https://nitter.tokhmi.xyz 27-Jul-23 17:06:36 - Error fetching https://nitter.tokhmi.xyz, trying another random instance https://nitter.weiler.rocks { "tweets": [ { "link": "https://twitter.com/sza/status/1684461977647845377#m", "text": "DND: a lifestyle", "user": { "name": "SZA", "username": "@sza", "avatar": "https://pbs.twimg.com/profile_images/1600953582605328384/t5skIcVh_bigger.jpg" }, "date": "Jul 27, 2023 \u00b7 7:13 AM UTC", "is-retweet": true, "external-link": "", "quoted-post": {}, "stats": { "comments": 0, "retweets": 21292, "quotes": 0, "likes": 44352 }, "pictures": [], "videos": [], "gifs": []

It seems that the nitter instances responds in different ways and create misinterpretations.https://nitter.riverside.rocks Both instances that gave the right behaviour are updated to the Version [2023.07.24-20b5cce] (nitter.net has the same version).

psegovias commented 1 year ago

it could be something related to the instance and not to the scraper

bocchilorenzo commented 1 year ago

Thanks @psegovias , just pulled your request. I also think there are some issues with some instances. For example, the nitter,tokhmi.xyz instance is having issues right now as seen here: immagine

Closing the issue for now.

redfalcoon commented 1 year ago

Hi Lorenzo and Patricio, I've done a small patch to the code and now the process end without error, in the code I've changed the following block (lines 113-121)

               if soup.find_all("div", class_="show-more")[-1].find("a").text == "Load newest":
                    keep_trying = False
                    soup = None
                else:
                    logging.warning(
                        f"Empty profile on {instance}, trying another random instance"
                    )
                    instance = self.get_random_instance()
                    count += 1

with: try: if soup.findall("div", class="show-more")[-1].find("a").text == "Load Newest": keep_trying = False soup = None else: logging.warning( f"Empty profile on {instance}, trying another random instance" ) instance = self.get_random_instance() count += 1 except: continue and now the scraper do not generate error. It's a workround not really optimized, I believe that in the except section can be managed the random selection of the instance but that is and seems that works well and solve my issue. Thank you for your great work.