drawrowfly / tiktok-scraper

TikTok Scraper. Download video posts, collect user/trend/hashtag/music feed metadata, sign URL and etc.
4.44k stars 805 forks source link

0 posts in collector #663

Open AyQWERTY opened 3 years ago

AyQWERTY commented 3 years ago

Describe the bug In my project, I collect basic profile information (subscribers, likes) and in each video views, comments and shares. My project is used for myself, so the number of requests is small, about 15 profileInfo, and 15 user(about 35 posts on account) requests per hour.

After half a day or a day, the scraper begins to scraping 0 posts. I tried running the script on two VPS (different IPs), and both after that time stopped collecting posts(collector.length = 0). I tried putting a breakpoint in the place where tiktok-scraper receives response, to see what it receives and if it receives anything at all. And yes it does, but it's a response with no body, just head, scripts and stuff(If you need, I can send you response).

To Reproduce Just try to get user info throughout the day.

Info (VPS#1 | VPS#2)

DanKaplanSES commented 3 years ago

I'm experiencing errors around getting videos and user metadata, too. I run through a proxy. We use both the CLI and Module.

rileymiller commented 3 years ago

Seeing this too

BK-Go-Python commented 3 years ago

I'm experiencing errors around getting user metadata, too

boscacci commented 3 years ago

I was getting 0 data back, even after setting --session sid_tt=MY_SID_TT but I turned on NordVPN in "obfuscated" mode (which is supposed to avoid looking like a VPN) and it's flowing now

AyQWERTY commented 3 years ago

I was getting 0 data back, even after setting --session sid_tt=MY_SID_TT but I turned on NordVPN in "obfuscated" mode (which is supposed to avoid looking like a VPN) and it's flowing now

As a temporary solution, of course, not bad, but setting up somehow VPN on VPS which also hosts various websites, I think, would be a bad idea 😁

asportnoy commented 3 years ago

+1 Using rotating residential proxies @drawrowfly please take a look

asportnoy commented 3 years ago

Issue just fixed itself for me. Anyone else? Edit: nevermind, now failing again.

asportnoy commented 3 years ago

This was not the fix and was just a coincidence.

Original fix I found a fix which I released to my app about half an hour ago. So far it seems to be working pretty well. The "options" on the `tt_webid_v2` cookie seems to have caused some issues, not sure why. Removing them seems to fix it. I can't make a pull request right now but someone else is welcome to do so. As a temporary fix, go into the node module and edit `build/core/TikTok.js`. Replace line 167 with the following: ```js this.cookieJar.setCookie(`tt_webid_v2=69${helpers_1.makeid(17)};`, 'https://tiktok.com'); ``` Let me know if this works for anyone or if you notice any issues with it. Enjoy! Tip: If you're using the CLI, you can find the location of your `node_modules` folder by running `npm root -g`.
asportnoy commented 3 years ago

So the issue has come back for me intermittently

From what I can tell, the main thing we need to implement is the ttwid cookie (note: different from tt_webid_v2). The cookie is generated from a post request to https://www.tiktok.com/ttwid/check/ (set-cookie header). That request requires a fid parameter but I can't tell what generates that.

I also noticed this new script which seems relevant to the changes: https://www.tiktok.com/acrawler/webmssdk.js I was able to partially de-obfuscate that with http://jsnice.org/.

Hopefully this information was helpful.

asportnoy commented 3 years ago

Related: https://github.com/davidteather/TikTok-Api/issues/685

AyQWERTY commented 3 years ago

I found a fix which I released to my app about half an hour ago. So far it seems to be working pretty well. The "options" on the tt_webid_v2 cookie seems to have caused some issues, not sure why. Removing them seems to fix it. I can't make a pull request right now but someone else is welcome to do so.

As a temporary fix, go into the node module and edit build/core/TikTok.js. Replace line 167 with the following:

this.cookieJar.setCookie(`tt_webid_v2=69${helpers_1.makeid(17)};`, 'https://tiktok.com');

Let me know if this works for anyone or if you notice any issues with it. Enjoy!

Tip: If you're using the CLI, you can find the location of your node_modules folder by running npm root -g.

All my requests stopped working last night. I.e. if before at least getUserProfileInfo worked, now even it doesn't work for me. All calls end with "Error: Can't extract user metadata from the html page. Make sure that user does exist and try to use proxy". I tried what you partially did, but unfortunately it didn't work for me. That's why I let all traffic from TikTok through Proxifier and it worked (not the solution, but at least it worked for me ;)).

rzv commented 3 years ago

Seeing this as well Tested with mobile phone ips (which worked best in testing so far - no blocking whatsoever)

asportnoy commented 3 years ago

I heavily reduced my request rate and I seem to be having a decent success rate now. Don't want to speak too soon but looks promising.

isaackogan commented 3 years ago

So far most "solutions" seem to be using proxies, which is what I thought we're all already doing... If you reduce requests down to the ground and constantly rotate proxies, sure you won't get blocked, but you're also going to be rotating more frequently costing more $$ and on top of that getting data less frequently too.

garoto commented 3 years ago

Youtube does the same thing to IPs that request too much data during an undisclosed threshold amount of time. Just put your script behind a reasonable timed sleep call.

rzv commented 3 years ago

Youtube does the same thing to IPs that request too much data during an undisclosed threshold amount of time. Just put your script behind a reasonable timed sleep call.

I have 30 minutes between calls

garoto commented 3 years ago

I have 30 minutes between calls

Then your originating IP is already throttled down perhaps?

isaackogan commented 3 years ago

I have 30 minutes between calls

Then your originating IP is already throttled down perhaps?

They've always done this, but TikTok seems to be more strict now for user posts, i.e something has changed recently and we just don't know what

asportnoy commented 3 years ago

My guess is some missing or invalid parameter is what's flagging it, so we should figure out what that parameter is and how to generate/implement it.

From what I can tell, there are 2 different cookies added: ttwid and R6kq3TV7. There is also a msToken and X-Bogus query param, although msToken is sometimes blank

rzv commented 3 years ago

I have 30 minutes between calls

Then your originating IP is already throttled down perhaps?

No - the tests were made with 4G ips (tethering from several colleagues phones) like this:

  1. Local script on my laptop (the script that used to work fine)
  2. Connect to colleague's phone wifi
  3. Test the script -> fail
  4. Repeat for several phone providers and locations -> fail

We still see random successes in production but the fail rate is ~80-90%. The thing that surprises me is that it fails for 'fresh' 4g ips. Also, not talking proxies here And i got consistent fails.

Sorr for the incoherent bug report but we still struggle to see a pattern

ThatSameer commented 3 years ago

Hi,

This is the first time I'm writing on a Github issue so apologies if my reply is incomplete.

I am also having this issue whereby I am now getting 0 posts in the collector using the .user function. I use a rotating proxy. The issue began around 2nd September 18:00 UTC. All of my proxies now are getting this.

I hope the time stamp of when it started may be useful for this issue. Thanks.

isaackogan commented 3 years ago

I use the rotating proxy data can be requested, if your is residential proxies will be sealing IP

what

BK-Go-Python commented 3 years ago

I use the rotating proxy data can be requested, if your is residential proxies will be sealing IP

what

Rotating proxy can be used to request data, if your agent is residential proxies, then your IP will be limited

isaackogan commented 3 years ago

Rotating proxy can be used to request data, if your agent is residential proxies, then your IP will be limited

This is not a proxy issue. We have clearly established that this is an issue with the library.

roman-hrybinchuk commented 3 years ago

Any update with this issue, any fix or so ?

drawrowfly commented 3 years ago

Any update with this issue, any fix or so ?

patience

boscacci commented 3 years ago

I was getting 0 data back, even after setting --session sid_tt=MY_SID_TT but I turned on NordVPN in "obfuscated" mode (which is supposed to avoid looking like a VPN) and it's flowing now

This ^ isn't doing it for me anymore. Switching to a new NordVPN server works only for one or two API calls. Implementing an exponential backoff didn't help. I have been looking for a way to programmatically make NordVPN switch VPN servers on macOS client but that functionality only seems widely available for *nix or windows.

I even tried creating a Tor proxy on my localhost that rotates IP's every couple requests and feeding that to tiktok-scraper, like --proxy socks5://localhost:9050. It seems like I was successful in creating a local tor proxy, like it works for my web browser, but it doesn't work with TikTok-API. Maybe I need to find a way to pass verifyfP or something.

BK-Go-Python commented 3 years ago

i find msToken in "https://www.tiktok.com/acrawler/webmssdk.js"

function () { let _0x5a5c69 = _0x4c1915(); _0x5a5c69 && (_0x366f76['msToken'] = _0x5a5c69, _0x366f76['msStatus'] = _0x5f1248['asgw']), setTimeout(function () { _0x31f5e3(), _0xeabaae(), _0x1e84bf(), _0x57197e(); }, -0x670 + 0x7be * 0x4 + -0xcd0), _0x2f5cef(['/web/report']); }()

but i can't run it

BK-Go-Python commented 3 years ago

i find msToken in "https://www.tiktok.com/acrawler/webmssdk.js"

function () { let _0x5a5c69 = _0x4c1915(); _0x5a5c69 && (_0x366f76['msToken'] = _0x5a5c69, _0x366f76['msStatus'] = _0x5f1248['asgw']), setTimeout(function () { _0x31f5e3(), _0xeabaae(), _0x1e84bf(), _0x57197e(); }, -0x670 + 0x7be * 0x4 + -0xcd0), _0x2f5cef(['/web/report']); }()

but i can't run it

It's too difficult

isaackogan commented 3 years ago

i find msToken in "https://www.tiktok.com/acrawler/webmssdk.js"

function () { let _0x5a5c69 = _0x4c1915(); _0x5a5c69 && (_0x366f76['msToken'] = _0x5a5c69, _0x366f76['msStatus'] = _0x5f1248['asgw']), setTimeout(function () { _0x31f5e3(), _0xeabaae(), _0x1e84bf(), _0x57197e(); }, -0x670 + 0x7be * 0x4 + -0xcd0), _0x2f5cef(['/web/report']); }()

but i can't run it

The issue isn't necessarily the msToken, and either way that isn't actually the token, but a reference to it

BK-Go-Python commented 3 years ago

either

So where should we start

ncsft commented 3 years ago

I was getting 0 data back, even after setting --session sid_tt=MY_SID_TT but I turned on NordVPN in "obfuscated" mode (which is supposed to avoid looking like a VPN) and it's flowing now

This ^ isn't doing it for me anymore. Switching to a new NordVPN server works only for one or two API calls. Implementing an exponential backoff didn't help. I have been looking for a way to programmatically make NordVPN switch servers on macOS but that functionality only seems widely available for nix or windows.

I even tried creating a tor proxy on my localhost and feeding that to tiktok-scraper, like --proxy socks5://localhost:9050. It seems like I was successful in creating a local tor proxy, like it works for my web browser, but it doesn't work with TikTok-API. Maybe I need to find a way to pass verifyfP or something.

They detecting tiktok-scraper and block by ip address. With fresh proxy you can make 1-2 requests, then you are banned.

boscacci commented 3 years ago

They detecting tiktok-scraper and block by ip address. With fresh proxy you can make 1-2 requests, then you are banned.

Right; so I tried rotating my Tor IP's with each request, and I still haven't gotten one request to work through Tor. It seems like TikTok can tell the request is coming over Tor, or something is acting unexpectedly with TikTok-API's proxy feature.

fwiw, I get this warning (node:2983) [DEP0123] DeprecationWarning: Setting the TLS ServerName to an IP address is not permitted by RFC 6066. This will be ignored in a future version. when I pass TikTok-API my localhost Tor proxy

JimJones13 commented 3 years ago

I have 30 minutes between calls

Then your originating IP is already throttled down perhaps?

No - the tests were made with 4G ips (tethering from several colleagues phones) like this:

1. Local script on my laptop (the script that used to work fine)

2. Connect to colleague's phone wifi

3. Test the script -> fail

4. Repeat for several phone providers and locations -> fail

We still see random successes in production but the fail rate is ~80-90%. The thing that surprises me is that it fails for 'fresh' 4g ips. Also, not talking proxies here And i got consistent fails.

Sorr for the incoherent bug report but we still struggle to see a pattern

Also just did a test via 4G hotspot AND created a new account and was able to get the script to produce when IP was blocked via network wifi.

denispostilnyak commented 3 years ago

I've investigated about requests from TikTok UI and seems like we need to understand how to generate x-tt-params request header. Issue in this part

garoto commented 3 years ago

All I see on this GH issue is a bunch of people who never managed to produce/generate a reproducible scenario. No wonder drawrowfly is ignoring this issue for this long lol. I wish this was locked and set to members-only so I could stop receiving notifications.

isaackogan commented 3 years ago

Damn that's cool...

Sorry for the notification btw. But not really.

Most comments are people actively trying to reproduce and pool together resources, and while it would be quieter if he locked it, it wasn't reaaallly worth the snarky comment

garoto commented 3 years ago

"snarky"

There's nothing "snarky" about what I wrote.

garoto commented 3 years ago

Typical third-world native mindset right there, lol (expat BRs sao hilarios).

So far reading this issue, nothing reproducible was demonstrated. All I see is a bunch of what ifs, but not a single tcpdump to be seen.

I have being downloading tiktok content using this program for quite some time now with ZERO issues, but NO, i'm the retarded one; yeah right.

This is a shitty bug report to begin with.

asportnoy commented 3 years ago

All I see on this GH issue is a bunch of people who never managed to produce/generate a reproducible scenario. No wonder drawrowfly is ignoring this issue for this long lol

There are many comments in here that have been able to reproduce it, so I don't know what makes you say that. And on top of that Andrew (creator) acknowledged the issue. We know it is an issue, we just don't know the fix.

Andrew has other things to do and we can't expect him to work on this 24/7 for free.

This is a shitty bug report to begin with.

No it's not. They thoroughly explained the issue and gave all the requested information. It is a legitimate issue that other people are able to reproduce.

In conclusion, kindly shut up.

drawrowfly commented 3 years ago

Please no off topic, i'm aware about changes made by tiktok and that the scraper isn't stable right now

JorjanLLT commented 3 years ago

few days ago, scraper work good ,but today it always fail, i think it made by tiktok upgrade,especially tiktok api domain change...just my guess

boscacci commented 3 years ago

Not a single tcpdump to be seen.

I've been writing python for a living for ~3 short years and I haven't yet needed to know what a tcpdump is. Call me green but it might actually be helpful if you demonstrated how posting a tcpdump gets us closer to understanding the problem at hand

I have being downloading tiktok content using this program for quite some time now with ZERO issues

Have you been able to identify what you're doing right that we're not doing? Can you add it to the readme?

victoryforphil commented 3 years ago

I have being downloading tiktok content using this program for quite some time now with ZERO issues, but NO, i'm the retarded one; yeah righ

FIRST off, I've been following this thread closely to figure out possibly how to resolve this issue. So thank you to everyone trying to figure this out!

2nd, how much data are you pulling. I can reproduce this bug across 3 environments (personal, GCP and self-hosted cluster), with 100+ proxies. I can get MAYBE get 1000k requests if I'm lucky, now its more like 3-5. We had many scrapers all running in their own ways (CLI, Docker, and custom scripts) and ALL started to see increase failure rates.

Tests:

Python's scraper is also failing

https://tikapi.io/ (paid) DOES seem to be working, but judging by their features, theyre doing a more emulated approach, so probably not as useful.

I would love to contribute more this conversation and share reproducibility and tests for anyone who finds that useful.

JorjanLLT commented 3 years ago

I was getting 0 data back, even after setting --session sid_tt=MY_SID_TT but I turned on NordVPN in "obfuscated" mode (which is supposed to avoid looking like a VPN) and it's flowing now

This ^ isn't doing it for me anymore. Switching to a new NordVPN server works only for one or two API calls. Implementing an exponential backoff didn't help. I have been looking for a way to programmatically make NordVPN switch servers on macOS but that functionality only seems widely available for nix or windows. I even tried creating a tor proxy on my localhost and feeding that to tiktok-scraper, like --proxy socks5://localhost:9050. It seems like I was successful in creating a local tor proxy, like it works for my web browser, but it doesn't work with TikTok-API. Maybe I need to find a way to pass verifyfP or something.

They detecting tiktok-scraper and block by ip address. With fresh proxy you can make 1-2 requests, then you are banned.

i dont think so, i use my personal mac with scraper and proxy, of course it fail, but just visit tiktok website via chrome ,its always good, no fail, so i guess these posts fail made by tiktok api upgrade. get the correct query params is important

JorjanLLT commented 3 years ago

I was getting 0 data back, even after setting --session sid_tt=MY_SID_TT but I turned on NordVPN in "obfuscated" mode (which is supposed to avoid looking like a VPN) and it's flowing now

This ^ isn't doing it for me anymore. Switching to a new NordVPN server works only for one or two API calls. Implementing an exponential backoff didn't help. I have been looking for a way to programmatically make NordVPN switch servers on macOS but that functionality only seems widely available for nix or windows. I even tried creating a tor proxy on my localhost and feeding that to tiktok-scraper, like --proxy socks5://localhost:9050. It seems like I was successful in creating a local tor proxy, like it works for my web browser, but it doesn't work with TikTok-API. Maybe I need to find a way to pass verifyfP or something.

They detecting tiktok-scraper and block by ip address. With fresh proxy you can make 1-2 requests, then you are banned.

i dont think so, i use my personal mac with scraper and proxy, of course it fail, but just visit tiktok website via chrome ,its always good, no fail, so i guess these posts fail made by tiktok api upgrade. get the correct query params is important

bad news, today i test scraper with proxy, then it failed. and i go to the tiktok web ,unfortunately i got ip banned, icant view the web content...seem like tiktok developers know the scraper works and ban the scraper work's ips...😭

Agrafador commented 3 years ago

I was getting 0 data back, even after setting --session sid_tt=MY_SID_TT but I turned on NordVPN in "obfuscated" mode (which is supposed to avoid looking like a VPN) and it's flowing now

This ^ isn't doing it for me anymore. Switching to a new NordVPN server works only for one or two API calls. Implementing an exponential backoff didn't help. I have been looking for a way to programmatically make NordVPN switch servers on macOS but that functionality only seems widely available for nix or windows. I even tried creating a tor proxy on my localhost and feeding that to tiktok-scraper, like --proxy socks5://localhost:9050. It seems like I was successful in creating a local tor proxy, like it works for my web browser, but it doesn't work with TikTok-API. Maybe I need to find a way to pass verifyfP or something.

They detecting tiktok-scraper and block by ip address. With fresh proxy you can make 1-2 requests, then you are banned.

i dont think so, i use my personal mac with scraper and proxy, of course it fail, but just visit tiktok website via chrome ,its always good, no fail, so i guess these posts fail made by tiktok api upgrade. get the correct query params is important

bad news, today i test scraper with proxy, then it failed. and i go to the tiktok web ,unfortunately i got ip banned, icant view the web content...seem like tiktok developers know the scraper works and ban the scraper work's ips...😭

Same thing happened to me, returns status 200 with empty body. They are now blocking requests, but not all of them, right know i'm blocked in the feed requests (/api/post/item_list), but the request of user profile (/node/share/user/) is fine. Also if you login everthing works as expected.

ncsft commented 3 years ago

The guys from davidteather's TikTok-Api seem to be moving forward.

It might be worth checking their progress so far to learn what they have learned: davidteather/TikTok-Api#695

It makes no sense to develop such scrapers. Tiktok will introduce new traps and ban those who use them. It's an endless race that cannot be won. Use headless browsers.

drawrowfly commented 3 years ago

The guys from davidteather's TikTok-Api seem to be moving forward.

It might be worth checking their progress so far to learn what they have learned: davidteather/TikTok-Api#695

It makes no sense to develop such scrapers. Tiktok will introduce new traps and ban those who use them. It's an endless race that cannot be won. Use headless browsers.

Lol, it make no sense on why you are here then and writing this message

BK-Go-Python commented 3 years ago

The guys from davidteather's TikTok-Api seem to be moving forward. It might be worth checking their progress so far to learn what they have learned: davidteather/TikTok-Api#695

It makes no sense to develop such scrapers. Tiktok will introduce new traps and ban those who use them. It's an endless race that cannot be won. Use headless browsers.

I tried using a headless browser and it often popped up with a sliding captcha I was desperate