d60 / twikit

Twitter API Scraper | Without an API key | Twitter Internal API | Free | Twitter scraper | Twitter Bot
https://twikit.readthedocs.io/en/latest/twikit.html
MIT License
1.45k stars 170 forks source link

Get entire Tweet thread #78

Closed ashutoshsaboo closed 6 months ago

ashutoshsaboo commented 6 months ago

Hi, so I was wondering if this lib can be used in anyway to unroll entire tweet threads to a html/markdown content of some sort? Kind of similar to what threadreaderapp and the likes do.

Basically if there's a thread of 10 tweets one after the other by the author, the lib should be able to get a list of those 10 tweets (along with their content) in serial order. I think last i had checked the twitter api, there was some thread_id/conversation_id (i forgot precisely) assigned to each tweet and a pointer to next_tweet in the same thread to accomplish the same. But given this lib doesn't use the twitter api, hence was wondering about the same. Is this possible currently? If not possible currently, if it's possible to please add support for the same in this lib?

Also many thanks for your great work on this lib! @d60

d60 commented 6 months ago

Hi, @ashutoshsaboo

Added Tweet.thread in version 1.5.7.

Example:

>>> tweet = client.get_tweet_by_id('1779883783321272691')
>>> tweet.thread
[<Tweet id="1779910037533536356">, <Tweet id="1779910453306560552">]
ashutoshsaboo commented 6 months ago

Woah! That was super quick! 🚀 So any tweet that I query by, even if that's not the first tweet in the thread, will this still give all the tweets in the original thread in sequential order? @d60

d60 commented 6 months ago

@ashutoshsaboo tweet.thread only includes tweets that are below that tweet. If you want to retrieve tweets above it, you need to use tweet.reply_to. So, to get all tweets in the thread, it would be like this:

t = client.get_tweet_by_id('123456789')
thread = [*t.reply_to, t, *t.thread]
ashutoshsaboo commented 6 months ago

Great this works perfectly! thanks! One small edge case i found for t.thread maybe you can return an empty list when no thread exists and it's a single tweet? It returns NoneType currently and that needs unnecessary handling with the iterable. I think you are doing that already for t.reply_to as well, so would be good to do it for t.thread as well? Possible to do this small change? @d60

One small offtopic but relevant ask, is it possible to vend out the source links in the tweets as well? Like when someone links to a page in a tweet, generally they are auto shortened by t.co prefix links, I'm wondering if there's a way to get the original source links for the same? I think for images and videos you vend out t.media which helps for the same, can something similar be also done for links? If i recall correctly, this was possible to get via the twitter apis. Given this is accessing source directly, should also be possible ideally?

d60 commented 6 months ago

@ashutoshsaboo Indeed, it would be better to have Tweet.thread return an empty list instead of None. I'll make that change in the next update. And you can use Tweet.urls to get the original URLs.

>>> tweet = client.get_tweet_by_id('1780709582958100825')

>>> print(tweet.text)
Whether you’re planning a drive or already on the road, we’re making it easier to find information about electric vehicle charging stations. Learn more ↓ https://t.co/mdVCGnYY2V
>>> print(tweet.urls)
[{'display_url': 'goo.gle/3U0OcPe', 'expanded_url': 'https://goo.gle/3U0OcPe', 'url': 'https://t.co/mdVCGnYY2V', 'indices': [154, 177]}]
ashutoshsaboo commented 6 months ago

Ohh i couldn't see in the Tweet class here about urls attribute, which is why was confused - https://twikit.readthedocs.io/en/latest/_modules/twikit/tweet.html#Tweet . I see it now as part of the constructor, but didn't see in the docs, hence couldn't spot it earlier. Should be a quick fix to add to the docs. @d60

This looks perfect but!

ashutoshsaboo commented 6 months ago

Hey @d60 , i just noticed for long tweets such as these - https://twitter.com/EFarraro/status/1669480575542116353 - tweet.text seems to be only partial and not the entire text? Is there some bug causing that?

I believe browsing long tweets has no constraint even if your account in non verified - so ideally given this is accessing source it should ideally work? Any idea you might have on how to go about for getting content for such tweets?

d60 commented 6 months ago

@ashutoshsaboo You can retrieve the full text of a tweet using Tweet.full_text.

ashutoshsaboo commented 6 months ago

@d60 problem is that full_text doesn't have the t.co media urls in place like t.text has looks like. Also I saw for short tweets, t.full_text is also many times empty, and only t.text contains the text. Only for longer tweets looks like t.full_text has the text, not otherwise (atleast from my limited testing).

Can this be made uniform to vend out valid t.full_text in all cases (i don't know if there's any additional network requests you need to make for getting full_text, but if it's a short tweet, and no extra content maybe just return t.text for t.full_text as well?), and to also have the t.co media/url links in place in full_text as well, just like it is the case in t.text? @d60

Without that, it's kind of challenging to access the entire tweet in a unified manner.

ashutoshsaboo commented 6 months ago

Any idea about the above? Can you help @d60

d60 commented 6 months ago

@ashutoshsaboo Hi, sorry for the delay.

I've released Version 1.5.11, and as you suggested, I've made adjustments to Tweet.full_text and Tweet.urls. Now, Tweet.full_text will contain Tweet.text even for short tweets. Additionally, Tweet.urls will include all URLs within the tweet.