JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.31k stars 698 forks source link

Content of tweet includes non written mentions #992

Open enzoferey opened 1 year ago

enzoferey commented 1 year ago

Describe the bug

Then scrapping the following tweet, the content returned starts like "@GitHubCopilot @tabnine @Replit @vercel Have you tried them ?" instead of just "Have you tried them ?" as expected.

How to reproduce

Use the TwitterTweetScraper and pass the tweet id 1674020720458776576.

Expected behaviour

There should be no non-written mentions at the beginning of the content.

Screenshots and recordings

No response

Operating system

macOS 13.4.1

Python version: output of python3 --version

3.9

snscrape version: output of snscrape --version

0.7.0.20230622

Scraper

TwitterTweetScraper

How are you using snscrape?

Module (import snscrape.modules.something in Python code)

Backtrace

No response

Log output

No response

Dump of locals

No response

Additional context

No response

JustAnotherArchivist commented 1 year ago

These mentions are technically part of the tweet text. This is exactly what Twitter returns:

...['tweet_results']['result']['legacy']['full_text'] = '@GitHubCopilot @tabnine @Replit @vercel Have you tried them ? What’s your opinion ? We read you πŸ‘€'

There is however also a display_text_range field. That should probably be taken into account for the renderedContent.

enzoferey commented 1 year ago

Thanks for pointing it out @JustAnotherArchivist πŸ™πŸ»

I did not realize that all accounts mentioned in a tweet are internally included in its replies (since you get notified about replies it makes sense πŸ˜„).

This might be a good opportunity for me to task as well about the differences of content, renderedContent, and rawContent ?

JustAnotherArchivist commented 1 year ago

Forget that content exists; it's a deprecated alias from the early days that will be removed eventually. (It emits a warning if you try to use it.)

rawContent is the exact tweet text Twitter returns, while renderedContent is (roughly) the text as it would be rendered on Twitter's web interface. The only difference there currently is the replacement of links, so it doesn't exactly match. For example, replies start with a mention of the replied-to user, which gets rendered separately on the web interface.

enzoferey commented 1 year ago

Links replacement you mean the https://t.co ones instead of the originals right? I’m using Puppeteer to navigate those and get the actual URLs.

So as far as I understood, I should be using renderedContent and there needs to be fix for the fact it should not include mentions on replies. Is this right ?