Open blueharmony opened 5 years ago
Hmm interesting, don't think I'm getting those same results.
One thing to make sure is that the team vectors are getting loaded in correctly. Try printing out the values for team1Vector and team2Vector, make sure they're different, and just do a sanity check (i.e Duke's 2019 team vector should have 29 wins) to make sure the right ones have been loaded.
I checked the 2019 team vectors and they are all zeroes, numbers for other years. So, I redownloaded the Kaggle and sports-reference data and fixed the csv headers for 2019 to be identical to 2018. Still getting the same results.
But, if I run the same data for 2018 instead of 2019 I get reasonable results! The .npy file for 2019 is smaller than the others. That may mean it is missing data. What would cause that?
Here are my TeamVector file, showing the 2019 size of 149948:
kcason@ubuntu:~/Desktop/newer/March-Madness-ML/Data/PrecomputedMatrices/TeamVectors]$ ll -rw-rw-r-- 1 kcason kcason 185340 Mar 20 08:50 2014TeamVectors.npy -rw-rw-r-- 1 kcason kcason 185397 Mar 20 12:36 2015TeamVectors.npy -rw-rw-r-- 1 kcason kcason 185473 Mar 20 12:37 2016TeamVectors.npy -rw-rw-r-- 1 kcason kcason 184366 Mar 20 12:38 2017TeamVectors.npy -rw-rw-r-- 1 kcason kcason 184208 Mar 20 12:39 2018TeamVectors.npy -rw-rw-r-- 1 kcason kcason 149948 Mar 20 12:42 2019TeamVectors.npy
When you run DataPreprocessing, are you entering 2018 or 2019 for the question "What year do you have data until?". It should be 2019. If you're already doing that, maybe trying testing the getSeasonData method with a random team and the year 2019, and then try to see if any of the pandas databases are getting filled or if it's immediately returning
It is immediately returning as you said. I added a print() to see when that is thrown, and all predictions are the same.
So are all the dataframes empty? Not sure why they would be because the csvs for MMStats_2019 and RatingStats_2019 are in the Data folder.
Not all of them, but a lot are returning as empty
Empty DataFrame
Columns: [Rk, School, G, W, L, W.L., SRS, SOS, W.1, L.1, W.2, L.2, W.3, L.3, Tm., Opp., X., MP, FG, FGA, FG., X3P, X3PA, X3P., FT, FTA, FT., ORB, TRB, AST, STL, BLK, TOV, PF]
Index: []
This is printed for all empty team.index and team_rating.index
I should add that I added some debugging lines, this is not printed from your version. I didn't want to add any confusion about why this was being printed
Hmm so I don't think I'm able to reproduce the same problems. When I run DataPreprocessing.py with 2019 as the answer to the input prompt, I get a dataframe with 353 rows with team info and a final xTrain of (131372, 17).
Maybe to be more specific, which years are returning empty for you?
Mine is returning Shape of xTrain: (126047, 17)
. I'll get back to you shortly about the dataframe rows and years
The 2019 data I added this morning should get you up to 131372. But even regardless of those changes, still doesn't explain why the team vectors for 2019, like OP was talking about, are zeros.
Gotcha. I was pulling from the sources listed in your ReadME. I'll try to use your data
Ah gotcha, yeah sorry that's my bad. I should've documented that there are some slight changes I had to make when I was copying over from BasketballReference. I think the only one was that I didn't include the topmost line that BasketballReference includes.
Using your data and then running DataPreprocessing.py
has fixed the issue for me. I don't know about the author, but the same issue was happening to me when pulling the new data from the listed sources.
Hmm okay, do you mind checking what the differences are between the format of the data you pulled and the ones in my Data folder? Would be good to add info on that in the Readme.
Yes! I'm about to step out but will try to include some automation to correct the changes to the data
With your latest changes pulled I now get the expected results. I still don't understand why my previous results were all zeroes, but at least it wasn't just me. :-)
I am running the version you posted just recently, with the pipenv stuff and Python 3.7.2. When I run MarchMadness.py the sample results for the East bracket always chooses team2 as the winner, and the probabilities for all games are the same. Is this expected behavior?
Here is my run: kcason@ubuntu:~/Desktop/newer/March-Madness-ML[kcason@ubuntu March-Madness-ML]$ pipenv run python MarchMadness.py Using TensorFlow backend. Shape of xTrain: (126047, 17) Shape of yTrain: (126047,) What year are these predictions for? 2019 Starting run #0: Finished run #0: Accuracy = 0.7552995684183803 Time taken: 0:00:30.114437
Starting run #1: Finished run #1: Accuracy = 0.7573940086316324 Time taken: 0:00:31.292216
Starting run #2: Finished run #2: Accuracy = 0.7557438436151307 Time taken: 0:00:29.775752
Starting run #3: Finished run #3: Accuracy = 0.7544427519675044 Time taken: 0:00:31.763139
Starting run #4: Finished run #4: Accuracy = 0.7525069814673775 Time taken: 0:00:29.563869
The average accuracy is 0.755077430820005
Loaded the team vectors /home/kcason/.local/share/virtualenvs/March-Madness-ML-XxyoxCtg/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning) Loaded the team vectors /home/kcason/.local/share/virtualenvs/March-Madness-ML-XxyoxCtg/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning)
Probability that NC Central wins over Duke: 0.5168695665038832 Probability that UCF wins over VA Commonwealth: 0.5168695665038832 Probability that Liberty wins over Mississippi St: 0.5168695665038832 Probability that St Louis wins over Virginia Tech: 0.5168695665038832 Probability that Belmont wins over Maryland: 0.5168695665038832 Probability that Yale wins over LSU: 0.5168695665038832 Probability that Minnesota wins over Louisville: 0.5168695665038832 Probability that Bradley wins over Michigan St: 0.5168695665038832