Open cogscides opened 6 months ago
Hi, you are right, it was removed intentionally, while migrating aiogram from v2 to v3, because that functionality was broken anyway. I will take a look and see if I can fix it
I did a quick research: they removed photo urls from response, we can only get a video url now. In order to get a photos we must to query api/item/detail/. And we need to sign request params with X-Bogus. It is special param which tiktok use to verify request. We can use it from browser (it exposed as window.byted_acrawler.frontierSign), but I'm trying to stick to http requests and avoid using browser to scrap tiktoks. So, we need to replicate signiture algorithm, which is obfusicated in js code.
I'm not sure if I can fix it. I will leave issue open, so maybe someone else will want to do that, or I will get a workaround for that
Regarding the X-Bogus: I understand the reason why it is not optimal to compute it using a browser. But could you perhaps consider implementing a workaround in the meantime?
A proof of concept using the headless Chrome via playwright:
from playwright.async_api import async_playwright
try:
data = script["__DEFAULT_SCOPE__"]["webapp.video-detail"]["itemInfo"]["itemStruct"]
except KeyError as ex:
# TODO: Cache the browser instance?
pw = await async_playwright().start()
browser = await pw.chromium.launch()
browser_page = await browser.new_page()
await browser_page.goto('https://tiktok.com/')
# TODO: Get the actual payload
payload = 'foo:bar'
signed = await browser_page.evaluate(f'() => window.byted_acrawler.frontierSign("{payload}")')
# TODO: Use the signature
print("Signed:", signed)
await browser_page.close()
await browser.close()
await pw.stop()
# TODO: Assign data instead
raise NoDataError from ex
This feature could be optionally enabled via an env flag (e.g. EXPERIMENTAL_IMAGES_SUPPORT=true
)
Would you be open for such a PR? (if I ever manage to make some time for it...)
Actually, my previous comment doesn't make much sense. If we are to resort to using a headless browser, then why even bother with making a api/item/detail/
request when we can just open the page and scrape the image URLs from under the .swiper-wrapper
class.
@arslan-charyyev Thanks for poc example. Though still I would prefer to not have this feature, rather than using headless browser. With browsers the environment setup is much harder, even if we are using docker, and I’m almost sure they will detect the browser and ask us to solve the captcha, and then we will need to bypass that. It’s a rabbit hole that I don’t want to go down. By the way, using api/item/details to get urls still make sense, it will be easier for tiktok servers if I just use api, rather than load full html page. But yeah, not much sense for us
For anyone interested, the feature/images
branch in my fork adds optional support for image downloads via the SignTok service. To enable the support, you have to provide a SIGNTOK_URL
environment variable that points to a deployed SignTok instance. Both TeleTok and SignTok can be easily deployed on a single machine via docker compose. But you could also use the deployed version by the SignTok author: https://signtok.pabloferreiro.es
The fork also adds a DISABLE_NOTIFICATION
environment config, as well as enables info-level logging, which shows useful logs from both the httpx
and aiogram
libraries. The logs are outputted to stdout
, which makes them observable via docker.
In the future I might add support for pulling the teletok image from the GHCR, so that it is not necessary to clone the fork.
I've made a new project with incorporates the ideas outlined above in a single project: https://github.com/arslan-charyyev/dinogram. This repo is listed in my list of acknowledgments, since I used some of its ideas in my own project.
There is an example, how someone else dealing with this problem - https://github.com/NearHuiwen/TiktokDouyinCrawler There is a copy of obfuscated js function from the site, and it executed through py_mini_racer (embedded V8 for Python) There is also https://github.com/PiotrDabkowski/Js2Py, which can not just run, but translate js code to Python
I see that similar functionality was requested and seems to be implemented in the issue 25. But this isn't working, and I don't see any code that could solve this.
Photo gallery example: https://www.tiktok.com/@repostpls0/photo/7354851498569846049