dfreelon / pyktok

A simple module to collect video, text, and metadata from Tiktok.
BSD 3-Clause "New" or "Revised" License
350 stars 45 forks source link

Expecting value: line 1 column 1 (char 0) issue with save_hashtag_video_urls() #6

Closed yjqian02 closed 2 years ago

yjqian02 commented 2 years ago

Thanks for all your hard work in this module! I've been using it to scrape TikTok videos by hashtag for a research study, but today when I try to run save_hashtag_video_urls() with any hashtag, I keep getting the following output:

Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0)

and it keeps repeating. I'm still able to use the other functions, and I first noticed this issue around 12 pm CDT today. Would this be an issue with the TikTok API changing?

JBGruber commented 2 years ago

I think TikTok deprecated the search for hashtags functionality (or moved it). The entire endpoint is gone / returning 404: https://www.tiktok.com/tag/

I already have an issue over at the R package repo: https://github.com/JBGruber/traktok/issues/4

dfreelon commented 2 years ago

So the endpoint is not gone, I just checked. When I look at the following API URL, it delivers the expected data. This means TikTok has changed the required parameters to deliver a valid response, which means that someone needs to go through the URL params to figure out which ones are necessary. The earliest I'll be able to get to that is next week most likely, but @JBGruber if you have time this week, please LMK which parameters are required and I'll fix it ASAP.

JBGruber commented 2 years ago

I also don't have time right now but will let you know if I find out more. Two new insights:

  1. Searches work from the browser with this url and (I think) only when the user is logged in.
  2. This curl call just worked for me:
    curl 'https://www.tiktok.com/api/search/item/full/?aid=1988&app_language=en&app_name=tiktok_web&battery_info=1&browser_language=en-GB&browser_name=Mozilla&browser_online=true&browser_platform=Linux%20x86_64&browser_version=5.0%20%28X11%3B%20Linux%20x86_64%29%20AppleWebKit%2F537.36%20%28KHTML%2C%20like%20Gecko%29%20Chrome%2F107.0.0.0%20Safari%2F537.36&channel=tiktok_web&cookie_enabled=true&count=20&device_id=7156603669521303045&device_platform=web_pc&focus_state=true&from_page=search&history_len=2&is_fullscreen=false&is_page_visible=true&keyword=%23rstats&offset=12&os=linux&priority_region=DE&referer=&region=NL&screen_height=1200&screen_width=1920&search_id=20221115133736010190209216170B7D15&tz_name=Europe%2FAmsterdam&verifyFp=verify_l9h6422m_m9VsjncG_5Ki6_49SS_BPx5_IeVGoLXl4P9h&webcast_language=en' \
    -H 'authority: www.tiktok.com' \
    -H 'accept: */*' \
    -H 'accept-language: en-GB,en;q=0.9,de-DE;q=0.8,de;q=0.7,en-US;q=0.6' \
    -H 'cookie: ***REDACTED***' \
    -H 'referer: https://www.tiktok.com/search/video?q=%23rstats&t=1668512698958' \
    -H 'sec-ch-ua: "Chromium";v="107", "Not=A?Brand";v="24"' \
    -H 'sec-ch-ua-mobile: ?0' \
    -H 'sec-ch-ua-platform: "Linux"' \
    -H 'sec-fetch-dest: empty' \
    -H 'sec-fetch-mode: cors' \
    -H 'sec-fetch-site: same-origin' \
    -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36' \
    --compressed

    challengeID and cursor are both gone. Pagination seems to work through the offset=12 bit. referer is just the search url + unix epoch time stamp. This might actually make things easier.

dfreelon commented 2 years ago

Yeah, /search/item/full works for me and it's probably better generally as it can be used to retrieve stitched videos as well. I'll make the change when time permits...

Jmallone commented 2 years ago

@JBGruber and @dfreelon Same thing happened in this issue from another project: https://github.com/davidteather/TikTok-Api/issues/976#issuecomment-1316747795

Removing any of the parameters from https://www.tiktok.com/api/challenge/item_list/ no longer returns anything

https://www.tiktok.com/api/challenge/item_list/ still has the cursor but not the offset, I tried to change it and it still doesn't work, it seems that __signature has to change together or msToken and X-Bogus

Do you have any idea how to resolve this?

dfreelon commented 2 years ago

@Jmallone Yes, use the endpoint @JBGruber identifies above in his curl code. You'll need to manually go through the parameters to figure out the required ones, but it should work better than /challenge/item_list/

Jmallone commented 2 years ago

@dfreelon Awsome I was using /challenge/item_list/ to get videos from hashtags, does /search/item/full/ do the same thing?

Because I used to take the challengeID of the Hashtag and in this /search/item/full i use keyword instead challengeID

dfreelon commented 2 years ago

I think so--we use a two-step process to pull video URLs from search using one function and get the videos themselves applying a second function to those URLs. You could at least follow a similar approach.

JBGruber commented 2 years ago

I've got a working wrapper for it now over at the R package. I was able to cut down the api call quite a bit. You can search for users, hashtags or just keywords without specifying anything. This seems to be an improvement for data access:

https://github.com/JBGruber/traktok/blob/a64b9e585b55831690661685e9408ee186f3a4c9/R/traktok.R#L357-L366

Essentially, A url would read like this for a user: https://www.tiktok.com/api/search/item/full/?keyword=%40chilipeppers&offset=0

For a hashtag: https://www.tiktok.com/api/search/item/full/?keyword=%2523rstats&offset=0

The only header I send are the cookies.

The only downside now is that the user who provides the cookies needs to be logged in.

Jmallone commented 2 years ago

@JBGruber I have a question, the offset to the next page is the value of the cursor ? image

https://www.tiktok.com/api/search/item/full/?keyword=%2523perobal&offset=36

JBGruber commented 2 years ago

Yes! Using the cursor is a way better idea than what I came up with :facepalm:. I simply counted the videos that were already returned and used that number as offset.

Jmallone commented 2 years ago

More Insights:

A minimal Code in Python

import requests

cookies = {
    'ttwid': ' ***REMOVED***',
    'sessionid': '*** REMOVED***'',
}

params = {
    'keyword': '%23rstats',
    'offset': '0',
}

response = requests.get('https://www.tiktok.com/api/search/item/full/', params=params, cookies=cookies)
response.text

if we remove ttwid in cookies this message happens:

{
  "status_code":2483,
  "status_msg":"Please login your account first",
  "log_pb":{
    "impr_id":"2022111718004042A4CCC7DE79FF1E30D0"
  }
}
Jmallone commented 2 years ago

I think you will have to use it with Delays, because here it was

{
  "status_code":2484,
  "status_msg":"Too many attempts. Try again in 1 hour.",
}
TimoBaeuerle commented 2 years ago

I took the save_hashtag_video_urls function from pyktok.py as draft. The params must contain any device_id, the keyword and offset. (The device_id is just a number with any 19 digits) The cookies only have to contain the ttwid-Cookie. The itemList has been renamed to item_list, the same applies to hasMore which has been renamed to has_more. This works for me:

import random
import requests
import sys
import time

def get_videos_by_keyword(keyword, limit=1000):
    cursor = 0
    while cursor < limit:
        params = {
            'device_id': '1234567890123456789',
            'keyword': keyword,
            'offset': cursor,
        }
        try:
            cookies = {
                'ttwid': '1%7CPm9bTMLMzjZ48RTfSWSxsyFOpGIaDfICGUjuSUtm4ng%7C1668717040%7Cdc9c307a7f02eeae1fed06237bd2d7635c52cf583dfcde8963d1580efc90cb35'}
            response = requests.get(
                'https://www.tiktok.com/api/search/item/full/',
                params=params,
                cookies=cookies)
            data = response.json()
            videos = data['item_list']
            counter = 0
            for video in videos:
                counter = counter + 1
                #desc = video['desc']
                #created = video['createTime']
                #author = video['author']
                #views = video['stats']['playCount']
                url = 'https://tiktok.com/@' + video['author']['uniqueId'] + '/video/' + video['id']
                print(url)
                if counter >= limit:
                    break
            cursor = cursor + len(videos)
            if data["has_more"] != 1:
                break
            time.sleep(random.randint(1, 3))
        except Exception as e:
            print('Stopped at cursor="'+cursor+'"')
    print('Done.')

def main():
    args = sys.argv[1:]
    if len(args) == 2 and args[0] == '-keyword':
        keyword = args[1]
        get_videos_by_keyword(keyword)
    if len(args) == 4 and args[2] == '-limit':
        keyword = args[1]
        limit = int(args[3])
        get_videos_by_keyword(keyword, limit)
    else:
        print('You have to enter some keyword, for example: -keyword "#fun #cats"')

if __name__ == "__main__":
    args = sys.argv[1:]
    main()
dfreelon commented 2 years ago

@TimoBaeuerle Thanks for this first draft of a new function! Do you want to add it as a pull req, or are you OK with me copy-pasting the code in and crediting you in the README? If you do the former you will be listed as an official contributor, in case you care about that sort of thing.

TimoBaeuerle commented 2 years ago

@dfreelon sure i can add this function into the projects repo. Should i just add the function or also update the existing save_hashtag_video_urls-Function to the new api-url and params?

Jmallone commented 2 years ago

Hi @TimoBaeuerle Did you resolve the "Too many attempts" with delays ?

TimoBaeuerle commented 2 years ago

Hi @Jmallone, since i used the device_id-Parameter i never got this error message again. Currently i'm not sure if the delay at the end or device_id is responsible for this. Maybe i'll find out today.

JBGruber commented 2 years ago

"Too many attempts" was also returned for me when I sent a malformed cookie string by accident. I haven't seen it since even without pauses between requests (but I only requested a couple 1000 videos so far for testing).

Jmallone commented 2 years ago

@TimoBaeuerle let me know what you find afterward :)

@JBGruber Interesting observation, in my tests i sent a complete Cookie parameters and after a few requests it simply "burned" the cookies and stopped working and this "Too many attempts" started to appear.

A trivia: Tiktok Announced research API update yesterday. I think that must be why the recent api changes.

dfreelon commented 2 years ago

@TimoBaeuerle If it's OK with you, I'd like to make extensive revisions to your code before I merge it back in--I realized it's possible to pull not only URLs but also other metadata with each call to search/item/full, but it will require me to rethink other pieces of pyktok first. So it will likely take a few days, but I'll credit you in the README when I push the changes, unless you object for whatever reason.

JBGruber commented 2 years ago

For inspiration, these are the fields I pull. vpluck just means to return NA if the field doesn't exist in the json and to check the returned type (e.g., integer):

  tibble::tibble(
    video_id              = vpluck(json[[entries]], "video", "id"),
    video_timestamp       = video_timestamp,
    video_url             = vpluck(json[[entries]], "video", "downloadAddr"),
    video_length          = vpluck(json[[entries]], "video", "duration", val = "integer"),
    video_title           = vpluck(json[[entries]], "desc"),
    video_diggcount       = vpluck(json[[entries]], "stats", "diggCount", val = "integer"),
    video_sharecount      = vpluck(json[[entries]], "stats", "shareCount", val = "integer"),
    video_commentcount    = vpluck(json[[entries]], "stats", "commentCount", val = "integer"),
    video_playcount       = vpluck(json[[entries]], "stats", "playCount", val = "integer"),
    video_description     = vpluck(json[[entries]], "desc"),
    video_is_ad           = vpluck(json[[entries]], "isAd", val = "logical"),
    author_name           = author_name,
    author_followercount  = vpluck(json[[entries]], "authorStats", "followerCount", val = "integer"),
    author_followingcount = vpluck(json[[entries]], "authorStats", "followingCount", val = "integer"),
    author_heartcount     = vpluck(json[[entries]], "authorStats", "heartCount", val = "integer"),
    author_videocount     = vpluck(json[[entries]], "authorStats", "videoCount", val = "integer"),
    author_diggcount      = vpluck(json[[entries]], "authorStats", "diggCount", val = "integer")
  )
dfreelon commented 2 years ago

@JBGruber Thanks--my idea is to build out a separate function that pulls all the metadata fields that can be used either for a single video or for the results of a search/item/full request. That will minimize the number of requests to the TikTok server and speed up runtime... trouble is finding time to actually write out the code...

azickri commented 2 years ago

Hey.. Thanks for great research and work.. I'm currently researching the Tiktok API, and I found out recently, the API I'm using (https://m.tiktok.com/api/challenge/item_list/) doesn't work anymore.

I want to ask, is it possible to specify the return data per page with the search/item/full API? I've tried using the cursor but it doesn't work.

dfreelon commented 2 years ago

@azickri We're working on it, see upthread...

azickri commented 2 years ago

@dfreelon, thats great.. If I found a solution, can I share it here?

dfreelon commented 2 years ago

@azickri Sure thing, although I have a pretty good idea of how I want to do it, so I may borrow bits of your code rather than integrating it intact, if that's OK

TimoBaeuerle commented 2 years ago

@TimoBaeuerle If it's OK with you, I'd like to make extensive revisions to your code before I merge it back in--I realized it's possible to pull not only URLs but also other metadata with each call to search/item/full, but it will require me to rethink other pieces of pyktok first. So it will likely take a few days, but I'll credit you in the README when I push the changes, unless you object for whatever reason.

Thats ok for me, thanks ;)

stefanuq commented 2 years ago

Hi All, thanks for finding a solution so fast! If I am not mistaken the latest changes in the TikTok API broke also the save_video_comments() function as well. Today I was trying to find a solution but it looks like that now we have to supply additional parameters like X-Bogus which are URL specific. If we change any of the other parameters (e.g. cursor) X-Bogus has to be changed as well somehow else we start to get empty responses.

Jmallone commented 2 years ago

Another error message that happened from trying to catch too much

{
  "status_msg":"You have reached the maximum number of searched today.",
  "log_pb":{"impr_id":"2022111811254781019205416127D860D2"},
  "status_code":2484
}

even set deviceId and delays, @TimoBaeuerle did this ever happen to you in you code?

dfreelon commented 2 years ago

@stefanuq Well in my defense I did try to warn y'all... I'll look into it and try to have some workable solutions by early next week. In the meantime, anyone is free to post working code here at any time, and I'll credit you if I end up using it.

ccshit commented 2 years ago

@Jmallone you need 2~ sec delay between offsets and 10 to 20 between search and you can run it 24/7

dfreelon commented 2 years ago

OK, I looked into this over the weekend and have a few observations:

  1. Pull the comments directly from the emulated source using something like BeautifulSoup, or
  2. Capture the relevant XHR request to /api/comment/list/, resend the request with all necessary URL params, and capture the results.

I hesitate to implement anything using Selenium for several reasons:

But if anyone figures out anything faster for comments between now and then, LMK and I'll consider implementing it.

Jmallone commented 2 years ago

@dfreelon Nice observations.

The problem with waiting for tiktok to do something about it is not knowing what date it will release this official api.

dfreelon commented 2 years ago

@Jmallone I'm willing to wait and see in the short term (also because I have other commitments...) but I will at least fix the search function, likely over the next few days

TimoBaeuerle commented 2 years ago

@Jmallone I have never seen this error message. The bot I'm building is still under development, so maybe i will see more errors in production.

dfreelon commented 2 years ago

OK, just pushed out new functions that can pull from search/item/full and get comments initially visible on a video page. Try it out and LMK if you encounter problems... also the new version is not available on PyPI yet, will do that later tonight.

dfreelon commented 2 years ago

OK, the latest version is now up on PyPI, so I'm going to close this until someone finds something wrong with it...

mandys commented 2 years ago

Has anyone observed that this search isn't exactly same as the hashtag search ? eg: tiktok.com/tag/ will give all entries with that tag

but, this end point is giving me - https://us.tiktok.com/api/search/item/full/?keyword=#votigo&offset=0

For "votigo" keyword ( something less popular ), I see results for vertigo veriligo vitiligo virgo voti

This is doing an actual search for anything that remotely matches the keyword.

Any ideas on how to do an absolute match ?

dfreelon commented 2 years ago

@mandys Yes, this is also what happens with the search when you use a browser. I find it annoying as well--the only thing I can think of is to search first and filter your results. (Also hashtag search is now prohibitively difficult to use programmatically; see upthread.)

azickri commented 2 years ago

Anybody have same issue? I try hit API (tiktok.com/api/search/full) with full request Cookie, but response is

{
 status_msg: 'Please login your account first',
 log_pb: { impr_id: '202211240158570102450591401C07BBFC' },
 status_code: 2483
}
dfreelon commented 2 years ago

You may need to be logged in to TikTok for some functions to work. Let me put that in the docs... I would also try setting the browser param to a browser on your system from which you have visited TT while logged in.

dfreelon commented 2 years ago

@mandys I should point out that it's fairly easy to pull the first 15 videos displayed on a hashtag search. I will probably write a function to do this at some point, but right now you can do something like:


import pyktok as pyk

ui = pyk.get_tiktok_json('https://www.tiktok.com/tag/uidesign')
#then parse through the data in ui['ItemModule']

Not nearly as good as before, but better than nothing.

azickri commented 2 years ago

Thats great, I will use this method if the API doesn't work.. Thanks for advice..

dfreelon commented 2 years ago

@mandys I should point out that it's fairly easy to pull the first 15 videos displayed on a hashtag search. I will probably write a function to do this at some point, but right now you can do something like:

import pyktok as pyk

ui = pyk.get_tiktok_json('https://www.tiktok.com/tag/uidesign')
#then parse through the data in ui['ItemModule']

Not nearly as good as before, but better than nothing.

OK I did this, see new function save_tiktok_multi_page which works with hashtag, user, and music pages. Only 30 videos per user and 15 per hashtag and song, but sometimes it be like that...

idksomethinggeneric commented 2 years ago

I feel like I'm missing something extremely obvious - pyk.save_tiktok_by_keyword (with save videos enabled) was working fine all day, now suddenly errors out immediately due to no item_list?

KeyError: 'item_list'
Stopped at cursor= 0

Other functions still seem to be working fine so I don't think I've been banned by tiktok

dfreelon commented 2 years ago

@idksomethinggeneric If you've been running it all day, my guess is TT might be throttling your usage. You can troubleshoot this by plugging any video URL into get_tiktok_json and inspecting the resulting JSON object. It should be quite large, but if it's small and/or conveys a message like this:

{
  "status_code":2484,
  "status_msg":"Too many attempts. Try again in 1 hour.",
}

...you should probably give it some time.

azickri commented 2 years ago

Any body have same issue now? When i request with full cookie after tiktok login, response from API https://www.tiktok.com/api/search/item/full/ got Blank response with status 200

JBGruber commented 2 years ago

Have you gone back to the tiktok website? You should be logged in and you sometimes get a captcha you need to solve. Only then your cookies are valid to make requests. The api annoyingly almost always returns 200.

azickri commented 2 years ago

@JBGruber, of course.. i was check to tiktok.com, and copy new cookie but same result. Are you have same issue or work normally?