drguthals / Introduction-to-Data-Science

This repository contains and introduction to data science for High School students using Azure Notebooks, Python, and Azure.
3 stars 2 forks source link

Planning #9

Open drguthals opened 5 years ago

drguthals commented 5 years ago

We need to start thinking about the data that we need to predict shots. What are your thoughts?

dashtsai02 commented 5 years ago

I was considering the question: What's the best time during a possession (24 second shot clock period) to attempt a field goal? In other words, does shooting the ball right when you dribble up the court (defenders still unsettled) lead to more success than waiting and attacking the defense (defenders might be tired out)?

Still looking for other real-life/NBA applicable questions to answer if this one doesn't work out.

A few articles that might give us more ideas: https://towardsdatascience.com/insights-from-raw-nba-shot-log-data-and-an-exploration-of-the-hot-hand-phenomenon-1f1c6c63685a

https://www.ksl.com/article/46267592/the-cutting-edge-of-sports-data-6-projects-that-could-change-our-understanding-of-basketball

drguthals commented 5 years ago

I like that idea.

So let's treat this like a real research project:

Question

Does shooting the ball right when you dribble up the court (defenders still unsettled) lead to more success than waiting and attacking the defense (defenders might be tired out)?

Desired Conclusion

What's the best time during a possession (24 second shot clock period) to attempt a field goal?

Important Data

Methodology

dashtsai02 commented 5 years ago

http://savvastjortjoglou.com/nba-shot-sharts.html

drguthals commented 5 years ago

You should be able to use this:

from azureml import Workspace
ws = Workspace(
    workspace_id='9ecb1b49360047818035f516eff41782',
    authorization_token='ATA7xvVKEODB3+gqsSdldMs2acnGHgmZl3MTXP7FfyRjkxt5T67HDiTDutmVC8w8PneRkRNByriOJBHkzzmQKw==',
    endpoint='https://studioapi.azureml.net'
)
ds = ws.datasets['NBA_Shots_2018_19.csv']
frame = ds.to_dataframe()
dashtsai02 commented 5 years ago

Glenn and I discussed and came up with a few more specific questions/subtopics that we'd like to answer (besides 'what is the optimal time during the 24 sec shotclock period to attempt a field goal?') that still fall under the category of the shot clock:

Does a team’s shot profile change over time?? (More layups or more jumpshots later in the possession etc...)

Does the chance that you get fouled increase later in the shotclock?

drguthals commented 5 years ago

Pandas Visualizations: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html

To define the shot profile of a team, we want these graphs:

From there we will start to identify what might be indicators of what, and start running some ML models against the data

drguthals commented 5 years ago
drguthals commented 5 years ago

For location, it might make more sense to create a string for each area on the court and use the coordinates to determine WHERE on the court top right top middle top left bottom right bottom middle bottom left

  1. Create a column of empty strings: df["Location"] = ""
  2. Change the column value to one of the areas depending on the location: df.loc[df.y >= 0 & df.y < 10 & df.x > 0 & df.x<= 10, 'Location'] = "topRight" Do that for each area, changing the values of the x and y comparisons (the 0 and 10) and the "topRight": df.loc[df.y >= 0 & df.y < 10 & df.x > 0 & df.x<= 10, 'Location'] = "topRight" df.loc[df.y >= 0 & df.y < 10 & df.x > 0 & df.x<= 10, 'Location'] = "topRight" df.loc[df.y >= 0 & df.y < 10 & df.x > 0 & df.x<= 10, 'Location'] = "topRight" df.loc[df.y >= 0 & df.y < 10 & df.x > 0 & df.x<= 10, 'Location'] = "topRight" df.loc[df.y >= 0 & df.y < 10 & df.x > 0 & df.x<= 10, 'Location'] = "topRight"
drguthals commented 5 years ago

Create a dataframe with the following columns:

Right: (y >= -24 and y < -5 and x < -27 and distance < 50) or (y <= 24 and y > 10 and x > 2727 and distance < 50)

Left: (y <= 24 and y > 10 and x < -27 and distance < 50) or (y >= -24 and y < -5 and x > 27 and distance < 50)

Center: (y <= -5 and y <= 10 and x < -27 and distance < 50) or (y <= -5 and y <= 10 and x > 27 and distance < 50)

Three: three == 1

Then: sns.pairplot(newDataFrame)

dashtsai02 commented 5 years ago

I got this error when making the new data frame

Screen Shot 2019-08-06 at 1 54 38 PM
dashtsai02 commented 5 years ago
shot_data["location"] = ""

shot_data.loc[((shot_data['location_y'] >= -24) & (shot_data['location_y'] < -5) & (shot_data['location_x'] < -27) & (shot_data['distance'] < 50)) | ((shot_data['location_y'] <= 24) & (shot_data['location_y'] > 10) & (shot_data['location_x'] > 27) & (shot_data['distance'] < 50)), 'location']='Right'

shot_data.loc[((shot_data['location_y'] <= 24) & (shot_data['location_y'] > 10) & (shot_data['location_x'] < -27) & (shot_data['distance'] < 50)) | ((shot_data['location_y'] >= -24) & (shot_data['location_y'] < -5) & (shot_data['location_x'] > 27) & (shot_data['distance'] < 50)), 'location']='Left'

shot_data.loc[((shot_data['location_y'] >= -5) & (shot_data['location_y'] <= 10) & (shot_data['location_x'] < -27) & (shot_data['distance'] < 50)) | ((shot_data['location_y'] >= -5) & (shot_data['location_y'] <= 10) & (shot_data['location_x'] > 27) & (shot_data['distance'] < 50)), 'location']='Center'

shot_data.loc[shot_data['three'] == 1, 'location'] = 'Three'

new_frame = shot_data.loc[:,['shot_type', 'shot_clock', 'outcome', 'three', 'fouled', 'distance', 'location']]

Checking each condition:

((shot_data['location_y'] >= -24) & (shot_data['location_y'] < -5) & (shot_data['location_x'] < -27) & (shot_data['distance'] < 50))
((shot_data['location_y'] <= 24) & (shot_data['location_y'] > 10) & (shot_data['location_x'] > 27) & (shot_data['distance'] < 50))
((shot_data['location_y'] <= 24) & (shot_data['location_y'] > 10) & (shot_data['location_x'] < -27) & (shot_data['distance'] < 50))
((shot_data['location_y'] >= -24) & (shot_data['location_y'] < -5) & (shot_data['location_x'] > 27) & (shot_data['distance'] < 50))
((shot_data['location_y'] >= -5) & (shot_data['location_y'] <= 10) & (shot_data['location_x'] < -27) & (shot_data['distance'] < 50))
((shot_data['location_y'] >= -5) & (shot_data['location_y'] <= 10) & (shot_data['location_x'] > 27) & (shot_data['distance'] < 50))
drguthals commented 5 years ago

Graphs: We have the following columns:

For things that are interesting, it would be good to have graph something like:

When three == 1:

Pie chart - Outcome 0 or 1

When outcome == 1:

Pie chart - Shot type

When outcome == 0:

Pie chart - Shot type

drguthals commented 5 years ago

df.whatever.value_counts().plot(kind='pie')

https://stackoverflow.com/questions/38337918/plot-pie-chart-and-table-of-pandas-dataframe

drguthals commented 5 years ago

Count of all of the shot_type's whether outcome is 0 or 1:

new_frame['shot_type'].value_counts().plot(kind='pie')

Count of all of the shot_type's only when outcome is 1:

new_frame.groupby('outcome').apply(lambda g: g[g['outcome'] == 1])['shot_type'].value_counts().plot(kind='pie')

OR

only_success = new_frame.groupby('outcome').apply(lambda g: g[g['outcome'] == 1])
only_success['shot_type'].value_counts().plot(kind='pie')

Google Queries

General Steps:

dashtsai02 commented 5 years ago

Additions since we last met: Using the same code structure as we did for location, I added a new column to our new dataframe called 'situation'. For example, if the shotclock was between 0-8 seconds I categorized it as a desperate shot etc... (See screenshot below)

Screen Shot 2019-08-07 at 1 32 56 AM

Different Pie Charts made from our new data frame: (See comments for details)

Screen Shot 2019-08-07 at 1 33 11 AM Screen Shot 2019-08-07 at 1 33 22 AM Screen Shot 2019-08-07 at 1 33 32 AM Screen Shot 2019-08-07 at 1 33 40 AM

Pretty much what I've done so far. Going to look over the graphs/make more before we meet to determine what to use in our model

dashtsai02 commented 5 years ago

Is there a way to filter rows based on more than one variable? For example, when outcome ==1 and shot_clock <=3 and distance > 25 ...

drguthals commented 5 years ago

Try this:

only_success = new_frame.groupby('outcome').apply(lambda g: g[g['outcome'] == 1 and g['shot_clock'] <= 3 and g['distance'] > 25])
only_success['shot_type'].value_counts().plot(kind='pie')

Oh these graphs look super interesting! I would recommend writing 1-2 sentences on what you're noticing. It will help having it in English instead of having to look at the graph again.

For example, it seems like you're more likely to get a successful outcome if you attempt a layup. It would be interesting to know percentage of total shots of each type

dashtsai02 commented 5 years ago

got this error message when I used the code above to sort by multiple variables:

KeyError: 'outcome'

drguthals commented 5 years ago

Can you share with me the entire error?

drguthals commented 5 years ago

Maybe add parenthesis?

only_success = new_frame.groupby('outcome').apply(lambda g: (g[(g['outcome'] == 1) and (g['shot_clock'] <= 3) and (g['distance'] > 25))])
only_success['shot_type'].value_counts().plot(kind='pie')
dashtsai02 commented 5 years ago

ValueError Traceback (most recent call last) ~/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs) 688 try: --> 689 result = self._python_apply_general(f) 690 except Exception:

~/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f) 706 keys, values, mutated = self.grouper.apply(f, self._selected_obj, --> 707 self.axis) 708

~/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis) 189 group_axes = _get_axes(group) --> 190 res = f(group) 191 if not _is_indexed_like(res, group_axes):

in (g) ----> 1 only_success = new_frame.groupby('outcome').apply(lambda g: g[g['outcome'] == 1 and g['shot_clock'] <= 3 and g['distance'] > 25]) 2 only_success['shot_type'].value_counts().plot(kind='pie') ~/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in __nonzero__(self) 1477 "Use a.empty, a.bool(), a.item(), a.any() or a.all()." -> 1478 .format(self.__class__.__name__)) 1479 ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). During handling of the above exception, another exception occurred: KeyError Traceback (most recent call last) ~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2656 try: -> 2657 return self._engine.get_loc(key) 2658 except KeyError: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'outcome' During handling of the above exception, another exception occurred: KeyError Traceback (most recent call last) in ----> 1 only_success = new_frame.groupby('outcome').apply(lambda g: g[g['outcome'] == 1 and g['shot_clock'] <= 3 and g['distance'] > 25]) 2 only_success['shot_type'].value_counts().plot(kind='pie') ~/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs) 699 700 with _group_selection_context(self): --> 701 return self._python_apply_general(f) 702 703 return result ~/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f) 705 def _python_apply_general(self, f): 706 keys, values, mutated = self.grouper.apply(f, self._selected_obj, --> 707 self.axis) 708 709 return self._wrap_applied_output( ~/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis) 188 # group might be modified 189 group_axes = _get_axes(group) --> 190 res = f(group) 191 if not _is_indexed_like(res, group_axes): 192 mutated = True in (g) ----> 1 only_success = new_frame.groupby('outcome').apply(lambda g: g[g['outcome'] == 1 and g['shot_clock'] <= 3 and g['distance'] > 25]) 2 only_success['shot_type'].value_counts().plot(kind='pie') ~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key) 2925 if self.columns.nlevels > 1: 2926 return self._getitem_multilevel(key) -> 2927 indexer = self.columns.get_loc(key) 2928 if is_integer(indexer): 2929 indexer = [indexer] ~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 2657 return self._engine.get_loc(key) 2658 except KeyError: -> 2659 return self._engine.get_loc(self._maybe_cast_indexer(key)) 2660 indexer = self.get_indexer([key], method=method, tolerance=tolerance) 2661 if indexer.ndim > 1 or indexer.size > 1: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'outcome'
drguthals commented 5 years ago

Or actually maybe it's like this (commas instead of and):

only_success = new_frame.groupby('outcome').apply(lambda g: (g[(g['outcome'] == 1), (g['shot_clock'] <= 3), (g['distance'] > 25))])
only_success['shot_type'].value_counts().plot(kind='pie')
dashtsai02 commented 5 years ago

I think the commas with no parentheses worked but then i got this:

Screen Shot 2019-08-07 at 1 40 46 PM
drguthals commented 5 years ago

Try the one with commas and parenthesis that I posted 9 minutes ago

dashtsai02 commented 5 years ago

Tried that one just now. It says invalid syntax

drguthals commented 5 years ago

Can you print only_success to make sure we have that column.

Also, I think you have a mismatched parenthesis on the previous line. Can you copy that whole code into a comment so that we can make sure they are aligned?

dashtsai02 commented 5 years ago

yep we have only_success

I fixed that mismatched parenthesis and then it gave me: SyntaxError: unexpected EOF while parsing

drguthals commented 5 years ago

Hmmm might be best to debug this when on the call then.

dashtsai02 commented 5 years ago
only_success['situation'].value_counts().plot(kind='pie', autopct='%1.0f%%')
only_success = new_frame.groupby('outcome').apply(lambda g: (g[(g['outcome'] == 1), (g['shot_clock'] <= 3), (g['distance'] > 25)]))
only_success['shot_type'].value_counts().plot(kind='pie')
dashtsai02 commented 5 years ago
only_success = new_frame.groupby('outcome').apply(lambda g: g[g['outcome'] == 1])
only_success['situation'].value_counts().plot(kind='pie', autopct='%1.0f%%')
only_success = new_frame.groupby('outcome').apply(lambda g: (g[(g['outcome'] == 1), (g['shot_clock'] <= 3), (g['distance'] > 25)]))
only_success['shot_type'].value_counts().plot(kind='pie')
drguthals commented 5 years ago
only_success = new_frame.groupby('outcome').apply(lambda g: (g[g['outcome'] == 1], g[g['shot_clock'] <= 3], g[g['distance'] > 25]))
only_success['shot_type'].value_counts().plot(kind='pie')
drguthals commented 5 years ago
only_success = new_frame.groupby('outcome').apply(lambda g: g[g['outcome'] == 1])
short_success = only_success.groupby('shot_clock').apply(lambda g: g[g['shot_clock'] < 3])
far_short_success = short_success.groupby('distance').apply(lambda g: g[g['distance'] > 25])
far_short_success.head()
far_short_success['shot_type'].value_counts().plot(kind='pie')