UBC-MDS / NBA-Minutes-Predictor

Predicting minutes played for a player in an upcoming NBA game.
MIT License
10 stars 3 forks source link

Picking a Dataset #1

Closed jnederlo closed 4 years ago

jnederlo commented 4 years ago

TO DO:

To start the discussion off, I think we can split ideas up into two general categories:

jnederlo commented 4 years ago

I'm more partial to a predicting/estimating problem. Some of my ideas are:

With the sports data, I would want to make a player classification system, and/or things like predicting some of their stats for upcoming games. For basketball specifically I would like to make a prediction model to predict the players on court minutes in the upcoming game. Both the NHL.com and NBA.com have an accessible API, and there are lots of data sets.

I'm open to other datasets though if somebody had a good idea.

Zhang-Haipeng commented 4 years ago

Thanks, Jarvis@jnederlo for giving this good kick-start. Personally, I'd prefer a predicting/estimating problem as well. And the "players on court minutes" question sounds interesting to me. Besides that, I'm also looking into a financial dataset, which is to build a model to predict whether the clients will repay their loans. The problem is it's a Kaggle dataset. I'll check with the instructors if that's a proper choice of data resource at all before we see it as an option.

jnederlo commented 4 years ago

@Zhang-Haipeng That's not a bad idea either. I know how to use the Kaggle API to get datasets programmatically if it's of any use.

jacktan1 commented 4 years ago

Great ideas! Personally I would be more interested in a financial one (predicting loan repayments sounds pretty applicable to future employment). Otherwise, predicting NBA court times would be an amusing topic. So ya, either works for me!

Zhang-Haipeng commented 4 years ago

I checked with Firas and it seems there's no license in the Kaggle dataset that I was looking into. So it might not be an option for us.
Also, I agree with Jack that it's applicable to employment. Credit Scoring is one of the major machine learning tasks in the financial industry IMO. So maybe I'll take some time to see if I can find other similar datasets tonight. But again, I'm totally fine with Jarvis' proposal. So if you guys want to just do it, it'll be all good with me. One question tho, isn't it more like a problem that requires regression? And more specifically I might see it as a time series analysis (which is not really covered yet in this program), where I want to use some autoregressive models to make predictions on the future court minutes using historical court minutes.

jnederlo commented 4 years ago

Final note, fivethirtyeight has good and clean datasets. @Zhang-Haipeng regarding the loan repayment dataset, I would want to make sure the data preparation step isn't too large, that would be my only concern.

Zhang-Haipeng commented 4 years ago

Let's focus on sports then. I've tried several datasets. They are either without a license or too heavy to work with. @jnederlo Do we have any specific dataset in mind?

jnederlo commented 4 years ago

We will use the NBA boxscore dataset from Kaggle: data.

We can access the data without authentication: here.