JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.51k stars 712 forks source link

Use original tweet info if note tweet is empty #954

Closed novucs closed 1 year ago

novucs commented 1 year ago

Fixes exception in _make_tweet:

kwargs['rawContent'] = noteTweet['text'] KeyError: 'text'

Reproducible when running TwitterProfileScraper for user: petemoore

JustAnotherArchivist commented 1 year ago

Interesting. I'm not able to reproduce it. Was this a temporary problem, or do you still see it?

novucs commented 1 year ago

Interesting. I'm not able to reproduce it. Was this a temporary problem, or do you still see it?

Unfortunately I'm no longer able to reproduce this issue for this particular user either. I can assure you it certainly was an issue, but as I'm using my own fork of this, I've not got any more up-to-date profiles to reproduce this error on. This did fix the issue for the user at the time. IIRC, I got back en empty note tweet {}, and a fully populated tweet['full_text']

JustAnotherArchivist commented 1 year ago

I don't suppose you still have the ID of the tweet that caused it?

The reason I'm hesitant on this is that I'd rather have something crash than return incomplete data. For example, if the affected tweets are note tweets but randomly rarely don't return the extended text, just silently ignoring that would be very bad.

novucs commented 1 year ago

I've found another occurrence! User ID: 58992791

Note tweet looks like this:

{'id': 'Tm90ZVR3ZWV0OjE2NTcwNzIxNzE5ODQ4NTUwNDA='}

Tweet looks like this:

{'bookmark_count': 0, 'bookmarked': False, 'created_at': 'Fri May 12 17:16:07 +0000 2023', 'conversation_control': {'policy': 'ByInvitation', 'conversation_owner_results': {'result': {'__typename': 'User', 'legacy': {'screen_name': 'pepecoinethl'}}}}, 'conversation_id_str': '1657072172064440326', 'display_text_range': [0, 279], 'entities': {'media': [{'display_url': 'pic.twitter.com/9DbGlWJ8DA', 'expanded_url': 'https://twitter.com/pepecoinethl/status/1657072172064440326/photo/1', 'id_str': '1657071836612395024', 'indices': [280, 303], 'media_url_https': 'https://pbs.twimg.com/media/Fv8aDkyWABA163c.jpg', 'type': 'photo', 'url': 'https://t.co/9DbGlWJ8DA', 'features': {'large': {'faces': [{'x': 806, 'y': 1341, 'h': 143, 'w': 143}]}, 'medium': {'faces': [{'x': 535, 'y': 890, 'h': 94, 'w': 94}]}, 'small': {'faces': [{'x': 303, 'y': 504, 'h': 53, 'w': 53}]}, 'orig': {'faces': [{'x': 806, 'y': 1341, 'h': 143, 'w': 143}]}}, 'sizes': {'large': {'h': 1808, 'w': 1449, 'resize': 'fit'}, 'medium': {'h': 1200, 'w': 962, 'resize': 'fit'}, 'small': {'h': 680, 'w': 545, 'resize': 'fit'}, 'thumb': {'h': 150, 'w': 150, 'resize': 'crop'}}, 'original_info': {'height': 1808, 'width': 1449, 'focus_rects': [{'x': 0, 'y': 273, 'w': 1449, 'h': 811}, {'x': 0, 'y': 0, 'w': 1449, 'h': 1449}, {'x': 0, 'y': 0, 'w': 1449, 'h': 1652}, {'x': 226, 'y': 0, 'w': 904, 'h': 1808}, {'x': 0, 'y': 0, 'w': 1449, 'h': 1808}]}}], 'user_mentions': [], 'urls': [{'display_url': 'stake-pepe.net', 'expanded_url': 'https://stake-pepe.net', 'url': 'https://t.co/SJS6UydK0N', 'indices': [175, 198]}], 'hashtags': [{'indices': [159, 164], 'text': 'PEPE'}, {'indices': [203, 210], 'text': 'CRYPTO'}, {'indices': [211, 215], 'text': 'ETH'}, {'indices': [216, 220], 'text': 'BTC'}, {'indices': [221, 226], 'text': 'BAYC'}, {'indices': [227, 233], 'text': 'LADYS'}, {'indices': [234, 243], 'text': 'MONGARMY'}, {'indices': [244, 249], 'text': 'CAPO'}, {'indices': [250, 254], 'text': 'BOB'}, {'indices': [255, 264], 'text': 'PEPEARMY'}, {'indices': [265, 270], 'text': 'DEFI'}, {'indices': [271, 278], 'text': 'MILADY'}], 'symbols': [{'indices': [0, 5], 'text': 'PEPE'}, {'indices': [113, 118], 'text': 'PEPE'}]}, 'extended_entities': {'media': [{'display_url': 'pic.twitter.com/9DbGlWJ8DA', 'expanded_url': 'https://twitter.com/pepecoinethl/status/1657072172064440326/photo/1', 'id_str': '1657071836612395024', 'indices': [280, 303], 'media_key': '3_1657071836612395024', 'media_url_https': 'https://pbs.twimg.com/media/Fv8aDkyWABA163c.jpg', 'type': 'photo', 'url': 'https://t.co/9DbGlWJ8DA', 'ext_media_availability': {'status': 'Available'}, 'features': {'large': {'faces': [{'x': 806, 'y': 1341, 'h': 143, 'w': 143}]}, 'medium': {'faces': [{'x': 535, 'y': 890, 'h': 94, 'w': 94}]}, 'small': {'faces': [{'x': 303, 'y': 504, 'h': 53, 'w': 53}]}, 'orig': {'faces': [{'x': 806, 'y': 1341, 'h': 143, 'w': 143}]}}, 'sizes': {'large': {'h': 1808, 'w': 1449, 'resize': 'fit'}, 'medium': {'h': 1200, 'w': 962, 'resize': 'fit'}, 'small': {'h': 680, 'w': 545, 'resize': 'fit'}, 'thumb': {'h': 150, 'w': 150, 'resize': 'crop'}}, 'original_info': {'height': 1808, 'width': 1449, 'focus_rects': [{'x': 0, 'y': 273, 'w': 1449, 'h': 811}, {'x': 0, 'y': 0, 'w': 1449, 'h': 1449}, {'x': 0, 'y': 0, 'w': 1449, 'h': 1652}, {'x': 226, 'y': 0, 'w': 904, 'h': 1808}, {'x': 0, 'y': 0, 'w': 1449, 'h': 1808}]}}]}, 'favorite_count': 11844, 'favorited': False, 'full_text': '$PEPE has surpassed 100,000 holders! \nTo honor this milestone, we have decided to launch staking support for the $PEPE token!  \n\n👇 Stake only via the official #PEPE website!!\nhttps://t.co/SJS6UydK0N\n\n--\n#CRYPTO #ETH #BTC #BAYC #LADYS #MONGARMY #CAPO #BOB #PEPEARMY #DEFI #MILADY… https://t.co/9DbGlWJ8DA', 'is_quote_status': False, 'lang': 'en', 'possibly_sensitive': False, 'possibly_sensitive_editable': True, 'quote_count': 1, 'reply_count': 34, 'retweet_count': 14352, 'retweeted': False, 'user_id_str': '1165187445534658560', 'id_str': '1657072172064440326'}

Tweet: https://twitter.com/pepecoinethl/status/1657072172064440326

JustAnotherArchivist commented 1 year ago

Thanks! The fulltext ending with an ellipsis and the account's other tweets all ending with two dozen hashtags and cashtags both indicate that this is indeed a broken, truncated note tweet. So I don't think silently ignoring that is a good idea.