Open drguthals opened 5 years ago
I was considering the question: What's the best time during a possession (24 second shot clock period) to attempt a field goal? In other words, does shooting the ball right when you dribble up the court (defenders still unsettled) lead to more success than waiting and attacking the defense (defenders might be tired out)?
Still looking for other real-life/NBA applicable questions to answer if this one doesn't work out.
A few articles that might give us more ideas: https://towardsdatascience.com/insights-from-raw-nba-shot-log-data-and-an-exploration-of-the-hot-hand-phenomenon-1f1c6c63685a
I like that idea.
So let's treat this like a real research project:
Does shooting the ball right when you dribble up the court (defenders still unsettled) lead to more success than waiting and attacking the defense (defenders might be tired out)?
What's the best time during a possession (24 second shot clock period) to attempt a field goal?
You should be able to use this:
from azureml import Workspace
ws = Workspace(
workspace_id='9ecb1b49360047818035f516eff41782',
authorization_token='ATA7xvVKEODB3+gqsSdldMs2acnGHgmZl3MTXP7FfyRjkxt5T67HDiTDutmVC8w8PneRkRNByriOJBHkzzmQKw==',
endpoint='https://studioapi.azureml.net'
)
ds = ws.datasets['NBA_Shots_2018_19.csv']
frame = ds.to_dataframe()
Glenn and I discussed and came up with a few more specific questions/subtopics that we'd like to answer (besides 'what is the optimal time during the 24 sec shotclock period to attempt a field goal?') that still fall under the category of the shot clock:
Does a team’s shot profile change over time?? (More layups or more jumpshots later in the possession etc...)
Does the chance that you get fouled increase later in the shotclock?
Pandas Visualizations: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html
To define the shot profile of a team, we want these graphs:
Time vs success: (scatter) Success . | . X . X . X . X . X . X Not Success | X . X . X . X X X
1 2 3 4 5 6 ... 24
Location vs Success (scatter) Y: Success, not success X: the 6 areas
Type of Shot vs Location Y: 5 types of shots X: the 6 areas
Type of Shot vs Time
Type of Shot vs Success
Foul vs time
Foul vs Success
Foul vs Location
From there we will start to identify what might be indicators of what, and start running some ML models against the data
For location, it might make more sense to create a string for each area on the court and use the coordinates to determine WHERE on the court top right top middle top left bottom right bottom middle bottom left
Create a dataframe with the following columns:
Right: (y >= -24 and y < -5 and x < -27 and distance < 50) or (y <= 24 and y > 10 and x > 2727 and distance < 50)
Left: (y <= 24 and y > 10 and x < -27 and distance < 50) or (y >= -24 and y < -5 and x > 27 and distance < 50)
Center: (y <= -5 and y <= 10 and x < -27 and distance < 50) or (y <= -5 and y <= 10 and x > 27 and distance < 50)
Three: three == 1
Then: sns.pairplot(newDataFrame)
I got this error when making the new data frame
shot_data["location"] = ""
shot_data.loc[((shot_data['location_y'] >= -24) & (shot_data['location_y'] < -5) & (shot_data['location_x'] < -27) & (shot_data['distance'] < 50)) | ((shot_data['location_y'] <= 24) & (shot_data['location_y'] > 10) & (shot_data['location_x'] > 27) & (shot_data['distance'] < 50)), 'location']='Right'
shot_data.loc[((shot_data['location_y'] <= 24) & (shot_data['location_y'] > 10) & (shot_data['location_x'] < -27) & (shot_data['distance'] < 50)) | ((shot_data['location_y'] >= -24) & (shot_data['location_y'] < -5) & (shot_data['location_x'] > 27) & (shot_data['distance'] < 50)), 'location']='Left'
shot_data.loc[((shot_data['location_y'] >= -5) & (shot_data['location_y'] <= 10) & (shot_data['location_x'] < -27) & (shot_data['distance'] < 50)) | ((shot_data['location_y'] >= -5) & (shot_data['location_y'] <= 10) & (shot_data['location_x'] > 27) & (shot_data['distance'] < 50)), 'location']='Center'
shot_data.loc[shot_data['three'] == 1, 'location'] = 'Three'
new_frame = shot_data.loc[:,['shot_type', 'shot_clock', 'outcome', 'three', 'fouled', 'distance', 'location']]
((shot_data['location_y'] >= -24) & (shot_data['location_y'] < -5) & (shot_data['location_x'] < -27) & (shot_data['distance'] < 50))
((shot_data['location_y'] <= 24) & (shot_data['location_y'] > 10) & (shot_data['location_x'] > 27) & (shot_data['distance'] < 50))
((shot_data['location_y'] <= 24) & (shot_data['location_y'] > 10) & (shot_data['location_x'] < -27) & (shot_data['distance'] < 50))
((shot_data['location_y'] >= -24) & (shot_data['location_y'] < -5) & (shot_data['location_x'] > 27) & (shot_data['distance'] < 50))
((shot_data['location_y'] >= -5) & (shot_data['location_y'] <= 10) & (shot_data['location_x'] < -27) & (shot_data['distance'] < 50))
((shot_data['location_y'] >= -5) & (shot_data['location_y'] <= 10) & (shot_data['location_x'] > 27) & (shot_data['distance'] < 50))
Graphs: We have the following columns:
For things that are interesting, it would be good to have graph something like:
Pie chart - Outcome 0 or 1
Pie chart - Shot type
Pie chart - Shot type
df.whatever.value_counts().plot(kind='pie')
https://stackoverflow.com/questions/38337918/plot-pie-chart-and-table-of-pandas-dataframe
new_frame['shot_type'].value_counts().plot(kind='pie')
new_frame.groupby('outcome').apply(lambda g: g[g['outcome'] == 1])['shot_type'].value_counts().plot(kind='pie')
OR
only_success = new_frame.groupby('outcome').apply(lambda g: g[g['outcome'] == 1])
only_success['shot_type'].value_counts().plot(kind='pie')
Additions since we last met: Using the same code structure as we did for location, I added a new column to our new dataframe called 'situation'. For example, if the shotclock was between 0-8 seconds I categorized it as a desperate shot etc... (See screenshot below)
Different Pie Charts made from our new data frame: (See comments for details)
Pretty much what I've done so far. Going to look over the graphs/make more before we meet to determine what to use in our model
Is there a way to filter rows based on more than one variable? For example, when outcome ==1 and shot_clock <=3 and distance > 25 ...
Try this:
only_success = new_frame.groupby('outcome').apply(lambda g: g[g['outcome'] == 1 and g['shot_clock'] <= 3 and g['distance'] > 25])
only_success['shot_type'].value_counts().plot(kind='pie')
Oh these graphs look super interesting! I would recommend writing 1-2 sentences on what you're noticing. It will help having it in English instead of having to look at the graph again.
For example, it seems like you're more likely to get a successful outcome if you attempt a layup. It would be interesting to know percentage of total shots of each type
got this error message when I used the code above to sort by multiple variables:
KeyError: 'outcome'
Can you share with me the entire error?
Maybe add parenthesis?
only_success = new_frame.groupby('outcome').apply(lambda g: (g[(g['outcome'] == 1) and (g['shot_clock'] <= 3) and (g['distance'] > 25))])
only_success['shot_type'].value_counts().plot(kind='pie')
ValueError Traceback (most recent call last) ~/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs) 688 try: --> 689 result = self._python_apply_general(f) 690 except Exception:
~/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f) 706 keys, values, mutated = self.grouper.apply(f, self._selected_obj, --> 707 self.axis) 708
~/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis) 189 group_axes = _get_axes(group) --> 190 res = f(group) 191 if not _is_indexed_like(res, group_axes):
Or actually maybe it's like this (commas instead of and
):
only_success = new_frame.groupby('outcome').apply(lambda g: (g[(g['outcome'] == 1), (g['shot_clock'] <= 3), (g['distance'] > 25))])
only_success['shot_type'].value_counts().plot(kind='pie')
I think the commas with no parentheses worked but then i got this:
Try the one with commas and parenthesis that I posted 9 minutes ago
Tried that one just now. It says invalid syntax
Can you print only_success
to make sure we have that column.
Also, I think you have a mismatched parenthesis on the previous line. Can you copy that whole code into a comment so that we can make sure they are aligned?
yep we have only_success
I fixed that mismatched parenthesis and then it gave me: SyntaxError: unexpected EOF while parsing
Hmmm might be best to debug this when on the call then.
only_success['situation'].value_counts().plot(kind='pie', autopct='%1.0f%%')
only_success = new_frame.groupby('outcome').apply(lambda g: (g[(g['outcome'] == 1), (g['shot_clock'] <= 3), (g['distance'] > 25)]))
only_success['shot_type'].value_counts().plot(kind='pie')
only_success = new_frame.groupby('outcome').apply(lambda g: g[g['outcome'] == 1])
only_success['situation'].value_counts().plot(kind='pie', autopct='%1.0f%%')
only_success = new_frame.groupby('outcome').apply(lambda g: (g[(g['outcome'] == 1), (g['shot_clock'] <= 3), (g['distance'] > 25)]))
only_success['shot_type'].value_counts().plot(kind='pie')
only_success = new_frame.groupby('outcome').apply(lambda g: (g[g['outcome'] == 1], g[g['shot_clock'] <= 3], g[g['distance'] > 25]))
only_success['shot_type'].value_counts().plot(kind='pie')
only_success = new_frame.groupby('outcome').apply(lambda g: g[g['outcome'] == 1])
short_success = only_success.groupby('shot_clock').apply(lambda g: g[g['shot_clock'] < 3])
far_short_success = short_success.groupby('distance').apply(lambda g: g[g['distance'] > 25])
far_short_success.head()
far_short_success['shot_type'].value_counts().plot(kind='pie')
We need to start thinking about the data that we need to predict shots. What are your thoughts?