Allow selection of exactly which parts of a model need to be retrieved

hopperelec commented 1 year ago

I'm trying to write a program to give me desktop notifications when someone I'm following posts a new TikTok, which means I need to fetch a lot of users and videos and it takes around 10 minutes to search through them all meaning it would be significantly quicker to just do it manually. But I think a lot of time is spent retrieving data I'm not interested in:

I already have all of the data needed to produce the User model, because it is given by the same endpoint I'm using to fetch the users I'm following, but in order for TikTokPy to retrieve their videos, it seems I can only give TikTokPy the unique ID and it fetches the rest of the information again itself. Most of this information isn't even needed for retrieving the videos!
I only need the ID for each video to identify videos I've already watched and to produce a link for me to watch it, but the only way I know to obtain the list of videos is iterating over User.videos and doing so also retrieves information such as the video's stats, comments, description and tags.

I'd imagine that if I was able to restrict TikTokPy into only retrieving the data I need, it would significantly improve the speed at which I can find new videos, and I doubt this is some niche case that only applies to me because there's few examples I can think of where a program would need all of the available data.

Russell-Newton commented 1 year ago

TikTokPy is slow for this use case because it loads every page you visit. There's a couple reasons why I did this instead of just making API calls:

My original use case was to download videos to embed them in Discord, and I didn't need to worry about the upscalability
This way the cookies will always be valid for the API requests (admittedly, it would probably be more efficient to just reload the cookies by loading a TikTok page if the API calls are blocked, and this would be a good enhancement for the future)
It tricks TikTok into thinking the library isn't a bot

For getting the IDs quickly, you could use user.videos._light_models. This is something I hid from the documentation, but its the internal list used by the iterator. It contains partial models.video.Video models that only contain video id, slightly inaccurate stats, and video create_time. I can definitely make this more clear in the documentation because the video.challenges iterator has the same field.

I had two reasons for doing the iterators this way:

The Video data collected from a UserPage is actually incomplete and inaccurate. For example, any sorts of stats have up to 4 significant figures. The data inaccuracy also applies to models.user.User and models.challenge.Challenge objects collected on VideoPages, and the videos under ChallengePages.
If there's a lot of videos you want to grab, the memory footprint could stay large for a while

hopperelec commented 1 year ago

Yes, that's fair, and I have already experienced some TikTok-related errors writing this program, which suggests I'm being affected by some bot-prevention or rate-limiting measures (in fact, I've even been given a few bot-related errors just using TikTok from the browser in general in the past, so it must be pretty strict!). I'd still imagine there'd be some requests which are being made which don't need to be depending on the data you need to access, though, or at the very least not parse all the information because TikTok doesn't know whether you're parsing it or not and I know, for example, comments are being parsed because I get an error message every time I start my program letting me know that the comments couldn't be loaded and that I can try again

Russell-Newton commented 1 year ago

Currently, all the requests are just requests made by the frontend scripts that run when you open a TikTok link. TikTokPy simply loads a link and intercepts all data collected by the frontend scrips from TikTok's API.

The comment thing is a known issue, #9. For whatever reason, TikTok blocks all Playwright instances I have tried from loading comments, even if you perform manual launch and navigation without TikTokPy. Specifically, when the frontend script that populates the comment section runs, it cant perform its normal requests due to a CORS error.

I've looked into autogenerating the Pydantic models used to store the scraped data such that their schemas could be determined at runtime. It's kind of a pain and seems pretty incompatible with any sort of intellisense or autodoc features. It would definitely be interested to dig into though to see if it could be possible to do such a thing.

Again, in the mean time you can iterate over the partial video objects (ID, stats, and create_time) with user.videos._light_models.

hopperelec commented 1 year ago

Alright, thank you!

Russell-Newton commented 1 year ago

This should also be faster with the regular iteration strategy if you make use of the updated scroll_down_time functionality. You can now specify a default for all API calls or you can just set it for a single API call. This means you could not set a default (API calls made via the iteration will not also scroll down) and set a scrolling time for the user call.

This is probably why it was taking so long for you. Whenever it would iterate to the next video it would have to scroll down again.

Russell-Newton / TikTokPy

Allow selection of exactly which parts of a model need to be retrieved #13