Closed gatesyp closed 3 years ago
@pomkos any additional thoughts?
Just that we should focus on the genres and languages that our users are in for now, or related genres.
@gatesyp asked me to write out my workflow and stuff. So here it is!
Workflow
pandas.DataFrame
Then start back at step 5 if any bugs or optimizations are needed. Tagging @Geczy cuz he was interested in the workflow.
Goals
(goal: 10k viewers)
Prequel: From the data POV, my goal is to ultimately have enough data to create a ML algorithm. To do this there are a couple milestones that need to be achieved.
Out.var | Pred.var1 | Pred.var2 | ... | Pred.varn |
---|---|---|---|---|
Good Clip | str | float | ... | int |
Bad Clip | str | float | ... | int |
... | ... | ... | ... | ... |
Bad Clip | str | float | ... | int |
Bad Clip | str | float | ... | int |
Where the outcome variable are labels, predictor variables are strings (categorical variables probs), integers, floats, etc of different types. Dataset shape is probably in the thousands of clips. 1000 is a good start, and the model will update as we get more data.
If we adapt my workflow to the overall goal of a ML algorithm, right now we are at step 5. The four algorithms are preliminary models that, based on user feedback, can be easily turned into calculators to get us new predictor variables. The brain
algorithm is a super rudimentary model that uses those predictor variables. I'll leave it at that for now but willing to go into it more if anyone wants to discuss.
Algorithm Options
Prequel: "good clips" are defined as "viral".
So so far we have 4 algos in the pillaralgos:
number of words this user typed in the entire stream / most of number of words any user typed in the entire stream
but intent is to expand and refine that definitionFuture algos (algorithms to test that might make good predictor variables) I think might be a good idea:
Ill add more as I think of them. Since we are going the viral route, all the CCCs that have > 1000 views or something can be analyzed using these clips. Then we can find out what makes them popular. Then we can compare with < 1000 view clips. >1000 can be "good" clips and <1000 "bad" clips. Then we can do like logistic regression or something and predict where any particular clip will be.
We want to better predict which moments in a twitch stream has the potential to go viral, and recommend those clips to users.
The first step is to have a dataset of viral clips, so we can test out different algorithms quickly.
We will likely start testing algorithms using only chat data, but having the dataset leaves the door open for other types of analyses too.
The goal is to create a list of clipObjects from the past 14 days:
And then we can use data science tools like pandas to perform different analyses, and find shared properties amongst the top clips.