[Feature] Download image collection

cogscides commented 4 months ago

I see that similar functionality was requested and seems to be implemented in the issue 25. But this isn't working, and I don't see any code that could solve this.

Photo gallery example: https://www.tiktok.com/@repostpls0/photo/7354851498569846049

captaincolonelfox commented 4 months ago

Hi, you are right, it was removed intentionally, while migrating aiogram from v2 to v3, because that functionality was broken anyway. I will take a look and see if I can fix it

captaincolonelfox commented 4 months ago

I did a quick research: they removed photo urls from response, we can only get a video url now. In order to get a photos we must to query api/item/detail/. And we need to sign request params with X-Bogus. It is special param which tiktok use to verify request. We can use it from browser (it exposed as window.byted_acrawler.frontierSign), but I'm trying to stick to http requests and avoid using browser to scrap tiktoks. So, we need to replicate signiture algorithm, which is obfusicated in js code.

I'm not sure if I can fix it. I will leave issue open, so maybe someone else will want to do that, or I will get a workaround for that

arslan-charyyev commented 4 months ago

Regarding the X-Bogus: I understand the reason why it is not optimal to compute it using a browser. But could you perhaps consider implementing a workaround in the meantime?

A proof of concept using the headless Chrome via playwright:

from playwright.async_api import async_playwright

try:
    data = script["__DEFAULT_SCOPE__"]["webapp.video-detail"]["itemInfo"]["itemStruct"]
except KeyError as ex:
    # TODO: Cache the browser instance?
    pw = await async_playwright().start()
    browser = await pw.chromium.launch()
    browser_page = await browser.new_page()
    await browser_page.goto('https://tiktok.com/')

    # TODO: Get the actual payload
    payload = 'foo:bar'
    signed = await browser_page.evaluate(f'() => window.byted_acrawler.frontierSign("{payload}")')

    # TODO: Use the signature
    print("Signed:", signed)

    await browser_page.close()
    await browser.close()
    await pw.stop()

   # TODO: Assign data instead
    raise NoDataError from ex

This feature could be optionally enabled via an env flag (e.g. EXPERIMENTAL_IMAGES_SUPPORT=true)

Would you be open for such a PR? (if I ever manage to make some time for it...)

arslan-charyyev commented 4 months ago

Actually, my previous comment doesn't make much sense. If we are to resort to using a headless browser, then why even bother with making a api/item/detail/ request when we can just open the page and scrape the image URLs from under the .swiper-wrapper class.

captaincolonelfox commented 4 months ago

@arslan-charyyev Thanks for poc example. Though still I would prefer to not have this feature, rather than using headless browser. With browsers the environment setup is much harder, even if we are using docker, and I’m almost sure they will detect the browser and ask us to solve the captcha, and then we will need to bypass that. It’s a rabbit hole that I don’t want to go down. By the way, using api/item/details to get urls still make sense, it will be easier for tiktok servers if I just use api, rather than load full html page. But yeah, not much sense for us

arslan-charyyev commented 4 months ago

For anyone interested, the feature/images branch in my fork adds optional support for image downloads via the SignTok service. To enable the support, you have to provide a SIGNTOK_URL environment variable that points to a deployed SignTok instance. Both TeleTok and SignTok can be easily deployed on a single machine via docker compose. But you could also use the deployed version by the SignTok author: https://signtok.pabloferreiro.es

The fork also adds a DISABLE_NOTIFICATION environment config, as well as enables info-level logging, which shows useful logs from both the httpx and aiogram libraries. The logs are outputted to stdout, which makes them observable via docker.

In the future I might add support for pulling the teletok image from the GHCR, so that it is not necessary to clone the fork.

captaincolonelfox commented 4 months ago

There is an example, how someone else dealing with this problem - https://github.com/NearHuiwen/TiktokDouyinCrawler There is a copy of obfuscated js function from the site, and it executed through py_mini_racer (embedded V8 for Python) There is also https://github.com/PiotrDabkowski/Js2Py, which can not just run, but translate js code to Python

captaincolonelfox / TeleTok

[Feature] Download image collection #28