JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.47k stars 709 forks source link

Note tweets not complete #826

Closed samuell07 closed 1 year ago

samuell07 commented 1 year ago

Describe the bug

Hi, I am trying to scrape data from multiple profiles for some time, so I am not sure whether this is a new issue.

When a user has a blue mark, he can have longer tweets, but only part of the tweet`s text is returned.

For example tweet https://twitter.com/RaviBhalla/status/1643041677232295936 containing A beautiful day to officially open the new field at the Northwest Resiliency Park with our softball and baseball teams! Couldn’t be prouder to have this available for the spring season, to help us bolster our recreation programs for both children and adults. This is a great moment for Hoboken, as the first athletic field to open in 10 years. I hope everyone enjoys! Looking forward to the grand opening of the rest of the park in June! returns A beautiful day to officially open the new field at the Northwest Resiliency Park with our softball and baseball teams! Couldn’t be prouder to have this available for the spring season, to help us bolster our recreation programs for both children and adults. This is a great…

How to reproduce

This code can be used to reproduce, in first part I used the method where I've found the issue, the second part is just to get the specific tweet

from datetime import date
import dateutil.relativedelta

import snscrape.modules.twitter as sntwitter

from_date = date.today() - dateutil.relativedelta.relativedelta(days=1)
tweets_list = list(
    sntwitter.TwitterSearchScraper(
        f'since:{from_date}, from:{"RaviBhalla"}'
    ).get_items()
)
print(tweets_list)
tweets_list = list(
    sntwitter.TwitterTweetScraper(
        tweetId="1643041677232295936"
    ).get_items()
)
print(tweets_list)

Expected behaviour

Should return whole tweet.

Screenshots and recordings

No response

Operating system

Windows 11,

Python version: output of python3 --version

Python 3.9.12

snscrape version: output of snscrape --version

0.6.2.20230320

Scraper

TwitterSearchScraper

How are you using snscrape?

CLI (snscrape ... as a command, e.g. in a terminal)

Backtrace

No response

Log output

No response

Dump of locals

No response

Additional context

No response

JustAnotherArchivist commented 1 year ago

Great job, Twitter: full_text does not contain the full text... Thank you for pointing this out! I thought I had tested this, but apparently not or not thoroughly enough.

AntoinePaix commented 1 year ago

It looks like there is a new key 'note_tweet' in the json which contains the full text not truncated.

The 'full_text' key contains the text before the 'show more button' displayed in the frontend.

AntoinePaix commented 1 year ago

It seems that the "note_tweet" key is not present in the advanced search JSON feed...

JustAnotherArchivist commented 1 year ago

... nor does it seem to give any indication whatsoever that there's a longer text. It even explicitly includes truncated: false.

This might be impossible to fix on searches. The tweet and profile scrapers can be done though.

AntoinePaix commented 1 year ago

It is not impossible that twitter will add the note_tweet key in the advanced search API in the next few days.

They had done this in December by adding the views_count. They first modified the profiles and tweet APIs before modifying the advanced search 4 weeks later.

Write commented 1 year ago

Is there any workaround for now to grab note_tweet ?

AntoinePaix commented 1 year ago

There is currently no workaround in the advanced search API as the full tweet is simply not returned. The only way would be to retrieve the tweet id and make a request to the TweetDetail API to retrieve the full tweet (but that would slow down the scraper significantly).

It is currently only possible to retrieve the note_tweet by scraping the UserTweets(AndReplies) API.

Write commented 1 year ago

Should I wait for an snscrape update, or is this something do-able without waiting for an update ? Guessing that new field need to be added to snscrabe before being usable. Thanks

JustAnotherArchivist commented 1 year ago

The SearchTimeline endpoint does return the relevant data, at least if you have the right feature flags enabled.

JustAnotherArchivist commented 1 year ago

This is now implemented and seems to work correctly on all scrapers. However, it is a bit difficult to find good examples to test edge cases, such as user mentions in the expanded part, hashtags, or cashtags. I also couldn't find any community with note tweets. If you have examples of these, please let me know, and if you find that anything is still incorrectly or incompletely extracted, please file a bug report.

Write commented 1 year ago

Thanks for the update.

I can't get to see it working, for example

snscrape --max-results 1 --jsonl twitter-tweet 1656693151694725122 | jq

Doesn't return note_tweet

Tweet url : https://twitter.com/Mediavenir/status/1656693151694725122

Same behavior for twitter-profile

EDIT : Sorry, it seems I have missed that the package can't yet be upgraded with pip install -U snscrape

(2) > snscrape --version
snscrape 0.6.2.20230320

Just upgraded with pip install -U git+https://github.com/JustAnotherArchivist/snscrape.git and I can confirm that rawContent now contains the whole tweet.

Thanks a lot for your work again.