benjaminguinaudeau / tiktokr

An R Scraper for Tiktok
Other
49 stars 7 forks source link

tiktokr

Lifecycle:
experimental

Disclaimer (January, 7th 2021)

At the beginning of December 2020, Tiktok changed its API structure and its security measures to control the traffic of metadata. As a result, requests made with tiktokr are blocked very often, if not systematically (error when parsing the json data structure).

After trying minor patches, we concluded that Tiktokr needs to be completely rewritten to fit the new infrastructure of Tiktok. Because none of the author has the time currently to rewrite the package, we putting it on hold for now and appologize for the resulting inconvenience. If you are interested in taking over the challenge, we are glad to share the knowledge that we have accumulated along the development of tiktokr.

The goal of tiktokr is to provide a scraper for the video-sharing social networking service TikTok.

While writing this library, we were broadly inspired by the Python module davidteather/TikTok-Api. You will need Python 3.6 or Docker to use tiktokr. If you want to use Docker check out the guide for that here.

Many thanks go to Vivien Fabry for creating the hexagon logo.

Overview

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("benjaminguinaudeau/tiktokr")

Load library

library(tiktokr)

Make sure to use your preferred Python installation

library(reticulate)

use_python(py_config()$python)

Install necessary Python libraries

tk_install()

Authentification

In November 2020, Tiktok increased its security protocol. They now frequently show a captcha, which is easily triggered after a few requests. This can be solved by specifying the cookie parameter. To get a cookie session:

  1. Open a browser and go to “http://tiktok.com
  2. Scroll down a bit, to ensure, that you don’t get any captcha
  3. Open the javascript console (in Chrome: View > Developer > Javascript Console)
  4. Run document.cookie in the console. Copy the entire output (your cookie).
  5. Run tk_auth() in R and paste the cookie.

Click on image below for screen recording of how to get your TikTok cookie:

The tk_auth function will save cookies (and user agent) as environment variable to your .Renviron file. You need to only run this once to use tiktokr or whenever you want to update your cookie/user agent.

tk_auth(cookie = "<paste here the output from document.cookie>")

Using tiktokr with Docker

TikTok requires API queries to be identified with a unique hash. To get this hash tiktokr runs a puppeteer-chrome session in the background. Apparently puppeteer sometimes causes issues on some operating systems, so we also created a Docker image, that can be run on any computer with Docker installed. Note: if you run tiktokr with Docker you won’t need a Python installation.

To find out if you are experiencing puppeteer problems run:

library(tiktokr)
Sys.setenv("TIKTOK_DOCKER" = "")
tk_auth(cookie = "<your_cookie_here>")
tk_init()
out <- get_signature("test")

if(stringr::str_length(get_docker_signature("")) > 16){
  message("Puppeteer works well on you computer")
} else {
  message("Puppeteer does not work, please consider using Docker")
}

If you experience problems try to install Docker as outlined in the steps below.

Installing Docker

If you have either a Mac, Linux (for example Ubuntu) or Windows 10 Professional / Education / Enterprise operating system, simply install Docker (click on respective hyperlinks).

If you only have Windows 10 Home the installation of Docker requires more steps.

  1. Follow the steps to install Windows Subsystem for Linux

  2. Follow the steps to install Docker on Windows Home

Initialize Docker

To run tiktokr with Docker you need to use tk_auth() with docker = TRUE which sets the necessary environment variable.

tk_auth(docker = T)

Now run tk_init() to set up the Docker container.

tk_init()

You can check whether your Docker container is working correctly by running the following code:

if(stringr::str_length(get_docker_signature("")) > 16){
  message("Signature successful. Your Docker container is working.")
} else {
  message("Unable to get signature")
}

Now try running the examples below.

Examples

For every session involving tiktokr, you will need to initialize the package with tk_init(). Once it is initialized you can run as many queries as you want.

tk_init()

Get TikTok trends

Returns a tibble with trends.

# Trend
trends <- tk_posts(scope = "trends", n = 200)

Get TikToks from User

user_posts <- tk_posts(scope = "user", query = "willsmith", n = 50) 

Get TikToks from hashtag

Note: Hashtags query only provides 2k hits, which are not drawn randomly or based on the most recent post date but rather some mix of recent and popular TikToks.

hash_post <- tk_posts(scope = "hashtag", query = "maincharacter", n = 100) 

Get TikToks from music id

Note: Hashtags query only provides 2k hits, which are not drawn randomly or based on the most recent post date but rather some mix of recent and popular TikToks.

user_posts <- tk_posts(scope = "user", query = "willsmith", n = 50)
music_post <- tk_posts(scope = "music", query = user_posts$music_id[1], n = 100)

Download TikTok Videos

With tk_dwnl you can download videos from TikTok.

From Trends:

# fs::dir_create("video")
trends <- tk_posts(scope = "trends", n = 5)

trends %>%
  split(1:nrow(.)) %>%
  purrr::walk(~{tk_dwnl(.x$video_downloadAddr, paste0("video/", .x$id, ".mp4"))})
# fs::dir_delete("video")