[QUESTION] Scroll time on Docker

emacollins commented 1 year ago

Ask your question I tried containerizing my script with this package in Docker (Dockerfile below). When it runs, I am able to get user information back, but it seems that the scroll time is not taken into account? When I set a high scroll time running on my host locally, it returns all of a users videos, even if they have a lot. When running the same code on my container, it only returns a fraction of the data (first 30 videos). I am using the data_dump_file (I can see the file size is much smaller on the data file when running through Docker) Any ideas?

# Use an official Python runtime as a parent image
FROM python:3.10-slim

# Set the working directory to /app
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt

# Make port 80 available to the world outside this container
EXPOSE 80

# Install system dependencies for Playwright
RUN playwright install-deps

# Install Playwright browser dependencies
RUN python -m playwright install

# Run run.py when the container launches
CMD ["python", "run.py"]

Russell-Newton commented 1 year ago

What version of the library are you using? Also can you share the snippet of code you're using that doesn't seem to be working in Docker?

emacollins commented 1 year ago

Hi, I am no longer going the Docker route. But I did run into the same problem (only 30 videos scraped even with a scroll time set), but on my local computer. It was working fine but now seems like no matter how high the scroll_time value is set to, it only get the first 30 videos. I am doing the json dump, and usually the extra videos are in the "extras" field. That is now blank.

I am using version 0.1.11.

I had usually been using scroll times between 10 and 300 sec, and it always seemed to return the extras pages with the full list of videos. Now it is not? Hmmm.

with TikTokAPI(scroll_down_time=scroll_time,navigation_retries=5, navigation_timeout=0, 
                data_dump_file=filename) as api:
            try:
                user_object = api.user(user, video_limit=0)
           except:
                pass

CarlCochet commented 1 year ago

I have the same issue currently ; can't seem to load more than 30 videos no matter how I setup scroll_down_time, also using version 0.1.11.

Russell-Newton commented 1 year ago

Previously, this problem arose due to what seemed to be a bug in Playwright. The fix at that time was to switch the web driver to Firefox, but if you're both having issues, it might mean the issue is presenting itself in Firefox now. I don't have a whole lot of time to address this, being a full time masters student, but I'll try to take a look soon.

vladisalv commented 1 year ago

Hi! Have the same issue. Tried:

driver: firefox/chromium
mobile emulate True/False
playwright 1.29/1.31/1.33

all these combinations. Nothing works out of the box :(

vladisalv commented 1 year ago

What can I use for scraping user video stats? Used LightVideo from user model. Can I get it the other way?

Russell-Newton commented 1 year ago

What can I use for scraping user video stats? Used LightVideo from user model. Can I get it the other way?

If you have a User object and want to grab data on that user's videos, use the user.videos iterator. Iterating through this will load each video on demand, getting accurate statistics, video info, etc. If all you need are loose statistics, the LightVideos are faster.

Russell-Newton commented 1 year ago

@emacollins @CarlCochet @vladisalv Please try again with version 0.1.12. I've added new parameters to the API constructors that you can try messing with:

scroll_down_delay sets the time (in seconds) before scrolling down is started. This is useful if your network is slow (e.g.: you're running TikTokPy in a Docker container)
scroll_down_iter_delay sets the time (in seconds) between scrolls. This can also be useful to tinker with if your network is slow.

I also suggest updating all dependencies.

Use:

scroll_down_delay now defaults to 1 second instead of an implicit 0 seconds. If this does not immediately fix your problems, my suggestions are as follows:

Try increasing scroll_down_iter_delay to 0.5 from the default 0.2. This will slow down the scrolling, which could help load the msToken cookie (see Explanation)
Try increasing scroll_down_delay to 3. This should also help load the msToken cookie.

Explanation:

Notably, TikTok provides browsers with an msToken cookie, and scrolling down doesn't work until this cookie is provided. If you scroll down too fast, you'll deadlock TikTok. Scrolling down further won't make any more API calls. The only way for this deadlock to be removed is to scroll back up and then back down. TikTokPy scrolls up a bit every other scroll-down, but if the iterative scroll-downs happen too fast, the deadlock might not let up. These two new parameters can alleviate these issues.

vladisalv commented 1 year ago

Hi @Russell-Newton ! I checked it not in Docker with good internet speed, but it doesn't work. Scraped only 30 videos from 300.

I looked at code, you use evaluate. Maybe use mouse wheel?

Russell-Newton commented 1 year ago

What values for scroll_down_time, scroll_down_delay, and scroll_down_iter_delay of you have set @vladisalv?

vladisalv commented 1 year ago

As you suggested above I started with:

scroll_down_time: 10
scroll_down_delay: 3
scroll_down_iter_delay: 0.5

I increased it step by step and finished with these values:

scroll_down_time: 120
scroll_down_delay: 60
scroll_down_iter_delay: 10

But has only 30 videos from more than 300.

As I understand, it scrolled down videos. Because by default I got just 27 videos. So, it scrolls page, but stopped at first iteration pagination.

Russell-Newton commented 1 year ago

@vladisalv Please try again on version 0.1.13, if you aren't already using it. I made some changes that should hopefully fix an issue with collecting extra videos.

vladisalv commented 1 year ago

@Russell-Newton still doesn't work

for clarifying how I use code:

        with TikTokAPI(scroll_down_time=20, scroll_down_delay=5, scroll_down_iter_delay=5) as api:
            user_stat = api.user(self.username, video_limit=1)
            video_count = user_stat.stats.video_count

            videos = []
            scroll_time = 20
            while True:
                print("Scroll time:", scroll_time)
                user = api.user(self.username,
                    scroll_down_time=scroll_time,
                    scroll_down_delay=5,
                    scroll_down_iter_delay=5,
                )
                scroll_time *= 2
                videos.clear()

                for v in user.videos.light_models:
                    videos.append(v)

                print("len of videos:", len(videos))
                print("we are waiting for", video_count)

                if len(videos) == video_count:
                    break

Output:

Scroll time: 20
len of videos: 30                                                                                         
we are waiting for 302                                                                                    
...
Scroll time: 160
len of videos: 30                                                                                         
we are waiting for 302

Also, I got with new 0.1.13 version this exception:

 File ".../.venv/lib/python3.10/site-packages/playwright/_impl/
_connection.py", line 96, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Protocol error (Network.getResponseBody): No data found for resource with given identifier

Russell-Newton commented 1 year ago

@vladisalv could you try again on your system with the 38-post-list-scroll-failure branch's code? Just to help with my debugging.

pip install -U https://github.com/Russell-Newton/TikTokPy.git@38-post-list-scroll-failure

And then you can try something simple like:

with TikTokAPI(scroll_down_time=120) as api:
    api.user("tiktok")

If my hunch is correct, the message Something went wrong should get printed out if you only collect 30 or so videos. If this is the case, that'll give me some more information about what's going wrong so that I may be able to fix it. My hunch is that it's related to this Reddit post: https://www.reddit.com/r/Tiktokhelp/comments/wybfcg/something_went_wrong_error_on_tiktok_web_via/.

Looking at the network logs, it seems like the API requests that attempt to grab the user posts sometimes return with a completely empty body. I'm able to recreate this locally, but it's inconsistent. I suspect I may have to do an overhaul like I suggest in #21 in order to completely fix this issue.

Russell-Newton commented 1 year ago

I think the changes I've been working on with v0.2 might fix this issue. It could be worth checking out:

pip install -U git+https://github.com/Russell-Newton/TikTokPy.git@v0.2-overhaul

I removed the scrolling parameters, but it should (fingers crossed) work without any API constructor parameters. You should be able to get away with:

with TikTokAPI() as api:
    user = api.user("tiktok")
    for video in user.videos:
        # do something

This should iterate over all of a user's videos. You can limit this using the video_limit parameter in api.user or using the limit method attached to user.videos (for video in user.videos.limit(30)).

@emacollins @CarlCochet @vladisalv If one or all of you could try with the WIP changes, that would be very helpful. It works for me, but it's worth verifying that it works for you.

Russell-Newton / TikTokPy

[QUESTION] Scroll time on Docker #38