alwaysbegrowing / pre-processing

0 stars 0 forks source link

Create dataset for viral clip prediction algorithm #32

Closed gatesyp closed 3 years ago

gatesyp commented 3 years ago

We want to better predict which moments in a twitch stream has the potential to go viral, and recommend those clips to users.

The first step is to have a dataset of viral clips, so we can test out different algorithms quickly.

We will likely start testing algorithms using only chat data, but having the dataset leaves the door open for other types of analyses too.

The goal is to create a list of clipObjects from the past 14 days:

startTime: int,
endTime: int,
views: int,
url: string,
videoId: string,
category: string,
streamer_username: string

And then we can use data science tools like pandas to perform different analyses, and find shared properties amongst the top clips.

gatesyp commented 3 years ago

@pomkos any additional thoughts?

pomkos commented 3 years ago

Just that we should focus on the genres and languages that our users are in for now, or related genres.

pomkos commented 3 years ago

@gatesyp asked me to write out my workflow and stuff. So here it is!

Workflow

  1. Import data
  2. Process data - generally involves extracting relevant columns, reorganizing it into a pandas.DataFrame
  3. Clean data - figure out what to do with missing data, standardize stuff (lowercase strings, etc.), decide what to do with outliers (are they just typos, mis-measurements, etc). Code categorical variables, format sentences and punctuation.
  4. EDA (exploratory data analysis) - get feel for data with descriptive stats, visualizations
  5. Create preliminary model - test whatever basic idea I got from EDA, see what the results are.
  6. Feature engineering - you are here create new variables that might be useful (number of emojis used, or user participation, etc.) Then testing them in the preliminary model, or create a new preliminary model.
  7. Create model - Use my tests to create an overarching script. Everything before this was done in Jupyter lab, this step consolidates the Jupyter cells into a python file script.
  8. Deploy to PyPi - includes unit testing, etc.

Then start back at step 5 if any bugs or optimizations are needed. Tagging @Geczy cuz he was interested in the workflow.

Goals

(goal: 10k viewers)

Prequel: From the data POV, my goal is to ultimately have enough data to create a ML algorithm. To do this there are a couple milestones that need to be achieved.

Out.var Pred.var1 Pred.var2 ... Pred.varn
Good Clip str float ... int
Bad Clip str float ... int
... ... ... ... ...
Bad Clip str float ... int
Bad Clip str float ... int

Where the outcome variable are labels, predictor variables are strings (categorical variables probs), integers, floats, etc of different types. Dataset shape is probably in the thousands of clips. 1000 is a good start, and the model will update as we get more data.

If we adapt my workflow to the overall goal of a ML algorithm, right now we are at step 5. The four algorithms are preliminary models that, based on user feedback, can be easily turned into calculators to get us new predictor variables. The brain algorithm is a super rudimentary model that uses those predictor variables. I'll leave it at that for now but willing to go into it more if anyone wants to discuss.

Algorithm Options

Prequel: "good clips" are defined as "viral".

So so far we have 4 algos in the pillaralgos:

Future algos (algorithms to test that might make good predictor variables) I think might be a good idea:

Ill add more as I think of them. Since we are going the viral route, all the CCCs that have > 1000 views or something can be analyzed using these clips. Then we can find out what makes them popular. Then we can compare with < 1000 view clips. >1000 can be "good" clips and <1000 "bad" clips. Then we can do like logistic regression or something and predict where any particular clip will be.