Marcus-Friis / thesis

Characterizing the structure of communication on TikTok using frequent subgraph mining, graph embeddings, and sentiment analysis.
2 stars 0 forks source link
frequent-subgraph-mining graph-embeddings machine-learning master-thesis network-analysis sentiment-analysis tiktok

TikTok Stitch Graph 🎵

In recent times, with the introduction of TikTok, Instagram reels, YouTube Shorts etc., short-form videos have become one of the main mediums for public discourse. This poses an interesting challenge, as understanding the contents of these videos analytically requires analyzing both the visual, auditory, and textual components of the content. Furthermore, platforms such as TikTok allows for responding to other videos through “stitches”, creating a network-like structure, where videos can respond to other videos. Fully grasping the nature of such a network requires understanding both the topological structure of the TikTok stitch network, along with the individual contents of each video. That is what this project aims to explore. Using video content, how can we improve our understanding of how people communicate using stitches? To this end, we will use a combination of image processing, NLP methods and network analysis.

<\

Project pipeline

%%{init: {'theme': 'dark', 'themeVariables': { 'primaryColor': '#4f98ca', 'edgeLabelBackground':'#2b2b2b', 'nodeTextColor': '#ffffff', 'background': '#1e1e1e'}}}%%

graph TD

    %% Define each step in the pipeline

    A[🤓 Start: Setup TikTok API Access ] 
    B[📝Collect Hashtag Videos using get_hashtag.py]
    C[🤏Extract edges using get_edges.py]
    D[🤏Extract targets using get_targets.py]
    E[📎Combine sources & targets using compose_vertices_files.py]
    F[(File storage)]
    G[🔽Download Videos using download_tiktok_vidoes.py]
    H[📈Perform Graph Analysis using graph_analysis.py <br> obtaining metrics and plots]
    I[✂Split videos into stichee and stitcher]
    J[🗣️Get Transcriptions]
    K[🙈Sentiment Analysis]

    %% I dont know the alphabet
    %% KLMNOPQRTSUVXYZ

    %% Connect the steps together

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    F --> H
    G --> I
    I --> J
    J --> K
   %% Add hrefs to the steps

    click A href "https://github.com/Marcus-Friis/thesis/tree/cleanup?tab=readme-ov-file#tiktok-api" "click A"

    click B href "https://github.com/Marcus-Friis/thesis/tree/cleanup?tab=readme-ov-file#get-hashtag-stitches" "click B"

    click C href "https://github.com/Marcus-Friis/thesis?tab=readme-ov-file#stitch-edge-scraper" "click C"

    click D href "https://github.com/Marcus-Friis/thesis?tab=readme-ov-file#extract-targets" "click D"

    click E "https://github.com/Marcus-Friis/thesis?tab=readme-ov-file#script-1-stitcher-and-stitchee-data-processing" "click E"

    click F "https://github.com/Marcus-Friis/thesis/tree/main/data" "click F"

    click G "https://github.com/Marcus-Friis/thesis?tab=readme-ov-file#helper-script-for-quickly-downloading-videos" "click G"

    click H "https://github.com/Marcus-Friis/thesis?tab=readme-ov-file#graph-analysis" "click H"

    click I "https://github.com/Marcus-Friis/thesis?tab=readme-ov-file#split-videos-into-sticher-and-stichee" "click I"

    %%click J "" "click J"

TikTok API

Getting started

The TikTok API documentaiton describes how to use it. We have created a notebook to explore the use of this API.

Setup TikTok API access

To access TikTok API, you need a client_key and client_secret. These are to be put in a /secrets/ directory. To set this up, copy /secrets_template/ as such

cp -r secrets_template secrets

Fill out the /secrets/tiktok.json file with your secrets, and it should work.

Get Hashtag stitches

The script get_hashtag.py scrapes TikTok videos (that are stitches) using a specific hashtag.

Usage

python src/get_hashtag.py HASHTAG_NAME

The script scrapes videos between 2024-05-01 and 2024-05-31 and saves them as {hashtag}.json in the data/ directory.

Example
python src/get_hashtag.py cooking

Stitch Edge Scraper

The script get_edges.py scrapes stitch relationships between TikTok videos using previously downloaded data.

Usage

python src/get_edges.py HASHTAG_NAME [START_INDEX]

The script processes videos from {hashtag}_.json and outputs the edges (stitcher -> stitchee) to {hashtag}_edges.txt.

Repair Mode

To repair incomplete edges:

python src/get_edges.py HASHTAG_NAME repair

Extract targets

The get_targets.py script processes a list of TikTok video URLs to extract stitchee video IDs—the videos that have been stitched by other users (stitchers). It then collects detailed data about these stitchee videos over specified date intervals using the TikTok API. The aggregated data is saved into a JSON file for further analysis.

Arguments

Usage

python get_targets.py cooking 5000

This will retrieve information for the stitchee videos associated with the cooking hashtag, processing 5,000 video IDs per API request batch.

Script for downloading TikTok videos

download_tiktok_videos.py is a simple script for quickly querying and downloading videos.

Usage

Run the script from the project root with the following command:

python src/download_tiktok_videos.py --start_date YYYYMMDD --end_date YYYYMMDD [options]
Arguments:

Example

python src/download_tiktok_videos.py --start_date 20240101 --end_date 20240110 --max_count 10 --keyword "stitch with"

Graph Analysis

The script graph_analysis.py produces various metrics for our graphs. It produces metrics for both the video- and user graph.

Usage

python src/graph_analysis.py HASHTAG_NAME [CREATE_PLOTS] [DO_PROJECTION]

Arguments:

Alternatively, you can use all to run the script on all hashtag graphs located in the vertices folder.

Example usage: python src/graph_analysis.py all true project
The above example will perform graph analysis on all hashtags, as well as their projections, and plots everything.

Graph Embeddings

This script graph_embed.py allows you to embed graphs using various algorithms and provides additional options for graph manipulation, visualization, and clustering. It embeds all the graphs created from the hashtags located it in the vertices folder.

python src/graph_embed.py ALGORITHM [DIRECTED][CREATE_PLOTS][ADD_RANDOM][CLUSTER][SAVE_PLOT][HELP]

Arguments:

Split videos into sticher and stichee

The script split_videos.py processes videos by detecting scene boundaries and splitting the videos into two parts: the "stitcher" and the "stitchee." It uses the AdaptiveDetector from scenedetect to find scene transitions and can apply custom thresholds for scene detection. If no significant scenes are detected, a default split at 5 seconds is applied. The default of 5 seconds are due to the nature of stitches; a stitch can be a maximum of 5 seconds long.

Usage

python split_videos.py HASHTAG_NAME [START_INDEX]

Arguments

Video Transcription Script

This script get_transcriptions.py processes videos for a specific hashtag, transcribes their audio using the Whisper model, and saves the transcriptions to a text file. It can handle multiple videos at once and supports resuming the process from a specific index.

Usage

python transcribe_videos.py HASHTAG_NAME [START_INDEX]

Arguments

Sentiment Analysis for Video Transcriptions

The script sentiment.py analyzes the sentiment of video transcriptions associated with a given hashtag. It reads transcriptions, uses VADER Sentiment Analysis to classify them as positive, negative, or neutral, and outputs the results to a file.

Usage

python analyze_sentiment.py HASHTAG_NAME

Arguments

How it works

The script reads from the path specified aboive, extracting every third line as transcription text. It uses the VADER Sentiment Analysis to score each transcription and classify it as positive, negative, or neutral based on the compound sentiment score. The classifications are saved to ../data/hashtags/videos/sentiments/HASHTAG_NAME_sentiment.txt, with each line containing the video index and its sentiment.

Smaller helper scripts

This section contains scripts that are useful for specific tasks but are not significant enough to warrant their own dedicated sections.

Script 1: Stitcher and Stitchee Data Processing

The script compose_vertices_files.py processes video data related to the "stitcher" and "stitchee" relationships from the hashtag's sources, targets, and edges files. It updates the JSON files in the vertices folder by adding information about which videos are stitchers and stitchees.

Usage:
python compose_vertices_files.py HASHTAG_NAME

These functions perform centralization analysis, checking for graph properties such as being undirected, simple, and having at least 3 vertices and 1 edge.

How to use:

# Example with degree_centralization only. The syntax is exactly the same for the others as well. 
from graph_utils import degree_centralization 

G = ig.Graph.Full(10, directed=False)  # Example graph
degree_centrality = degree_centralization(G) # To project, do: project_graph(G). 

Script 3: TikTok Utils

The tiktok_utils script is not designed to be a standalone tool, but rather a utility module used across various scripts for TikTok-related data collection and scraping. It provides key functionalities such as interacting with the TikTok API and scraping stitch links using Selenium.

Several scripts make use of tiktok_utils to handle specific tasks:

Key Features:

This utility module streamlines video data collection and scraping for TikTok, serving as the core component behind these scripts.

(NOT UPDATED) How Argument Parsing Works with sys.argv

These scripts (graph_analysis.py & graph_embed.py) use a flexible, keyword-based argument system. Instead of requiring flags (e.g., --flag), users can input keywords directly, with the order being unimportant and case ignored (e.g., true or True both work). The scripts scan for specific keywords to activate different features or modes of operation.