aisobran / Adv-ML-NFL

4 stars 2 forks source link

Temporal prep of play by play #2

Open aisobran opened 8 years ago

aisobran commented 8 years ago

Added initial logic for creating a temporal based data frame for ANNs.

It creates a dataframe of n-x, n-x+1, ...., n-2, n-1, n data for learning.

aisobran commented 8 years ago

The subset that is probably most relevant to prediction is: 'year', 'week', 'possession', 'yardsToGoalLine', 'quarter', 'down', 'togo', 'quarterTime', 'shotgun', 'complete', 'distance', 'direction', 'yardsGained', 'intercepted', 'noHuddle', 'touchdown', 'fumble', 'sacked', 'spiked', 'runDirection'

It's in temporalPivot.py

aadithya93 commented 8 years ago

Are some plays missing in the play by play data? I see that there only about 193k rows in play by play data while total plays in raw data is about 275k.

aisobran commented 8 years ago

Yeah special teams plays (field goal, kickoff, punt), injury updates, time warning, substitutions,and some other plays were removed. Only offensive run and pass plays are included.

aadithya93 commented 8 years ago

Okay. Doesn't field goal affect the decision for choosing a pass and run play? I feel that there may be other offensive plays also missing. For example, the third play in the first match PIT vs TEN of raw data is a pass play that's incomplete but not in the play by play data.

aisobran commented 8 years ago

I'll double check why it's missing, possibly could have been filtered out. I'll post back in a few.

Usually a field goal/punt will not affect the decision because they will happen on a fourth down where offense will have no more plays left and will turn the ball over to the other team. The data may be useful but I had to weigh the benefits of the extra time for parsing.

aadithya93 commented 8 years ago

Okay sure. Makes sense. Thanks

aisobran commented 8 years ago

Good catch the parser was filtering that play out. I'll push a new parsed set up after it's done processing.

anandvij3 commented 8 years ago

I'll start working on the Logistic Regression Algorithm.

aisobran commented 8 years ago

If you look at temporalPivot.py all the code from 76 and above will prepare the data for learning, it will labelencode the categorical variables then onehot encode them. It will also prepare a temporal pivot where all the data from same n-x plays will be combined with a n label for training (train and label in the code). This can be fed into any algorithm. For subsetting you can either subset based on pandas filtering or you can use the builtin function in the code selectedTeamAndWeek(data, team, week number).

aisobran commented 8 years ago

Pushed the new code. Now temporal pivot contains an object that handles all the reading, subsetting, and prepping of the data. You can see usage in annAnalysis.py. Lines 6-8, 16, 21.

If you need additional functionality you can either add it or ask me to.

aadithya93 commented 8 years ago

Thats very useful. thanks. I am trying support vector machines instead of markov models. I have issues downloading packages for hmm on windows. I ll try svm and then figure out how to get hmm working.

aadithya93 commented 8 years ago

I experimented with svm for different parameter settings. I got a best performance of 56% on test data and 67% on train data.

aisobran commented 8 years ago

I get similar results with ann. Haven't used test data yet as I'm setting up a framework for that. But I get at best 67.8% on training data. I added some of the results to the annStructureResults.txt which shows some of the ann structures and their associated training accuracies.

The optimization is actually still running so I'll have more results later.

Interestingly, the Rectifier activation function is in all ann structures.

https://en.wikipedia.org/wiki/Rectifier_(neural_networks)

aisobran commented 8 years ago

So I setup a testing framework in temporalPivot.py. You pass it in a model (needs fit and predict) with optional arguments for year, length (which is pivot length ie how many plays back), and dataSplit(percentage of train/test split). It will go through all the teams for that year and train and predict on the model and output the results to the console like so:

PIT,0.5 TEN,0.572327044025 ATL,0.556074766355 MIA,0.528662420382 KC,0.577540106952 BAL,0.621890547264 CAR,0.568965517241 PHI,0.533707865169 DEN,0.579881656805 CIN,0.606060606061 CLE,0.632653061224 MIN,0.60248447205 HOU,0.616071428571 NYJ,0.573033707865 IND,0.566502463054 JAC,0.51724137931 NO,0.641509433962 DET,0.671875 DAL,0.515337423313 TB,0.555555555556 ARI,0.590425531915 SF,0.531073446328 NYG,0.565445026178 WAS,0.598837209302 SEA,0.515789473684 STL,0.531468531469 GB,0.473684210526 CHI,0.517647058824 NE,0.68023255814 BUF,0.506097560976 OAK,0.604938271605 SD,0.658653846154

By the way those are the results for the ann run I just did. I'm putting all the results in the ann folder.

I think I will also start including the training accuracy and the actual distribution of plays for a team just as a benchmark.

aisobran commented 8 years ago

For usage checkout 150 for in annAlalysis.py.

aadithya93 commented 8 years ago

Added results and plot

aisobran commented 8 years ago

Added additional functionality to the test runner. It will output Team, Test Accuracy, Train Accuracy, Play Distribution

anandvij3 commented 8 years ago

I have pushed the logistic regression results.