Create dataset for viral clip prediction algorithm

gatesyp commented 3 years ago

We want to better predict which moments in a twitch stream has the potential to go viral, and recommend those clips to users.

The first step is to have a dataset of viral clips, so we can test out different algorithms quickly.

We will likely start testing algorithms using only chat data, but having the dataset leaves the door open for other types of analyses too.

The goal is to create a list of clipObjects from the past 14 days:

startTime: int,
endTime: int,
views: int,
url: string,
videoId: string,
category: string,
streamer_username: string

And then we can use data science tools like pandas to perform different analyses, and find shared properties amongst the top clips.

[ ] Create list of streamers in our ICP
[ ] Create list of clips created in past 14 days for each streamer
[ ] Format data + save to mongo
[ ] Run a one-off job to download chat data for each associated start_time:end_time:video_id
[ ] Build python script to download the data + load into Pandas

gatesyp commented 3 years ago

@pomkos any additional thoughts?

pomkos commented 3 years ago

Just that we should focus on the genres and languages that our users are in for now, or related genres.

pomkos commented 3 years ago

@gatesyp asked me to write out my workflow and stuff. So here it is!

Workflow

Import data
Process data - generally involves extracting relevant columns, reorganizing it into a pandas.DataFrame
Clean data - figure out what to do with missing data, standardize stuff (lowercase strings, etc.), decide what to do with outliers (are they just typos, mis-measurements, etc). Code categorical variables, format sentences and punctuation.
EDA (exploratory data analysis) - get feel for data with descriptive stats, visualizations
Create preliminary model - test whatever basic idea I got from EDA, see what the results are.
Feature engineering - you are here create new variables that might be useful (number of emojis used, or user participation, etc.) Then testing them in the preliminary model, or create a new preliminary model.
Create model - Use my tests to create an overarching script. Everything before this was done in Jupyter lab, this step consolidates the Jupyter cells into a python file script.
Deploy to PyPi - includes unit testing, etc.

Then start back at step 5 if any bugs or optimizations are needed. Tagging @Geczy cuz he was interested in the workflow.

Goals

(goal: 10k viewers)

Prequel: From the data POV, my goal is to ultimately have enough data to create a ML algorithm. To do this there are a couple milestones that need to be achieved.

Labeled data - basically a definition for "good" and "bad" clips, and either having our users label them (if the definition is subjective) or automate the labeling (if the definition can be quantified, like number of views).
Standardized dataset - for ML we need outcome and predictor variables. The outcome variable(s) is/are going to be whether the clip is good/bad or gonna be viral/stale. Predictor variables will most likely be coming from step 6, feature engineering. Unless there is some existing measure that can be a good predictor. Here's an example of an ideal dataset:

Out.var	Pred.var1	Pred.var2	...	Pred.varn
Good Clip	str	float	...	int
Bad Clip	str	float	...	int
...	...	...	...	...
Bad Clip	str	float	...	int
Bad Clip	str	float	...	int

Where the outcome variable are labels, predictor variables are strings (categorical variables probs), integers, floats, etc of different types. Dataset shape is probably in the thousands of clips. 1000 is a good start, and the model will update as we get more data.

If we adapt my workflow to the overall goal of a ML algorithm, right now we are at step 5. The four algorithms are preliminary models that, based on user feedback, can be easily turned into calculators to get us new predictor variables. The brain algorithm is a super rudimentary model that uses those predictor variables. I'll leave it at that for now but willing to go into it more if anyone wants to discuss.

Algorithm Options

Prequel: "good clips" are defined as "viral".

So so far we have 4 algos in the pillaralgos:

algo1: finds timestamp range (tr) where the most number of unique users participated
algo2: finds tr where most number of users sent the most messages per minute
algo3: this is the first attempt at a more intelligent algo. Goal is to rank users, with "heavy" users the most ranked higher. Right now "heavy" is just defined as number of words this user typed in the entire stream / most of number of words any user typed in the entire stream but intent is to expand and refine that definition
algo3.5: finds tr where the most number of words were typed or emojis used. Intent was to use as part of a user's 'weight'

Future algos (algorithms to test that might make good predictor variables) I think might be a good idea:

Flesh out and finish algo3, I think a user's "importance" or "weight" is a good indicator for a clip. More heavy users participating all at once is probably an ok clip. If you have heavy users AND users who haven't participated much before in a clip, that clip is probably a good one. If you have no heavy users but lots of new users or light users, it's probably spam. That's the theory anyways.
Sentiment analysis using just words (Idk how to do this yet, need to learn). More excitement = better clip.
Sentiment analysis using emojis (emoji labeling was started but code never implemented). More excited = better clip.
Spam detection. Related to algo3.5, but without accounting for users. Just raw number of repeated words. Like, if POG is repeated over and over by different users then that might be a good clip. Can even create clips and then tell our user "hey, POG was spammed here. GG here." etc. Then just find which of those spam words are associated with good clips.

Ill add more as I think of them. Since we are going the viral route, all the CCCs that have > 1000 views or something can be analyzed using these clips. Then we can find out what makes them popular. Then we can compare with < 1000 view clips. >1000 can be "good" clips and <1000 "bad" clips. Then we can do like logistic regression or something and predict where any particular clip will be.

alwaysbegrowing / pre-processing

Create dataset for viral clip prediction algorithm #32