dec1costello / TOUR-Championship-Strokes-Gained-Analysis

Develop an expected strokes model to identify player performance for the 2011 PGA TOUR Championship
1 stars 0 forks source link
ai expected-value golf ml sports sports-analytics sports-data sports-programming sports-stats strokes-gained

GitHub | Kaggle | LinkedIn
Author: Declan Costello
Open in Streamlit
Colab
Binder

TOUR Championship Analysis

Table of Contents
  1. Objectives
  2. Repo Overview
  3. Code Quality
  4. Dataset
  5. EDA
    1. SG per Round
    2. SG per Hole
    3. SG per Drive
  6. Expected Strokes Model
    1. Model Selection
    2. Training Architecture
    3. Fighting Bias
    4. Model Performance
  7. Applying xS Model
    1. SG per Shot Type
  8. Conclusion
  9. Future Roadmap

🎯 Objectives

Welcome to my analysis of the 2011 TOUR Championship at East Lake Golf Club, the primary objective of this project is to:

Develop an expected strokes model to identify player performance

I hope to contribute meaningful insights to the golf community through this project. Although the 2011 TOUR Championship took place over a decade ago and the tournament's rules have since changed, its extensive shot-level dataset remains a valuable resource. If you happen to come across another complete shot-level dataset, I would greatly appreciate it if you could share it with me! I encourage you to check out the js visuals on NBViewer and Streamlit Dashboard!

🌡 Repo Structure

This repo is organized as follows:

πŸ“‚ TOUR-Championship-Strokes-Gained-Analysis πŸ“
β”‚
β”œβ”€β”€ πŸ“‚ Data
β”œβ”€β”€ CITATION
β”œβ”€β”€ README.md
β”œβ”€β”€ CODE_OF_CONDUCT.md
β”‚
β”œβ”€β”€ πŸ“‚ EDA
β”‚   β”œβ”€β”€ EDA.ipynb
β”‚   β”œβ”€β”€ SGperHole.ipynb
β”‚   β”œβ”€β”€ SGperRound.ipynb
β”‚   β”œβ”€β”€ SGperDrive.ipynb
β”‚   β”œβ”€β”€ FeatureEngineering.ipynb
β”‚   └── πŸ“‚ EDAUtils
β”‚
β”œβ”€β”€ πŸ“‚ Creating Model
β”‚   β”œβ”€β”€ LazyPredict.ipynb 
β”‚   β”œβ”€β”€ PuttingModel.ipynb
β”‚   β”œβ”€β”€ ApproachModel.ipynb 
β”‚   └── πŸ“‚ OptimizingUtils
β”‚
β”œβ”€β”€ πŸ“‚ Applying Model
β”‚   β”œβ”€β”€ SGCreation.ipynb 
β”‚   └── SGAnalysis.ipynb
β”‚
└── πŸ“‚ Streamlit Dashboard

⭐ Code Quality

In this project, a Security Linter, Code Formatting, Type Checking, and Code Linting are essential for ensuring code quality and robustness. These help identify and mitigate security vulnerabilities, maintain consistent coding styles, enforce type safety, and detect potential errors or issues early in the development process, ultimately enhancing the reliability and maintainability of the project.

| Security Linter | Code Formatting | Type Checking | Code Linting | | ------------------------------------------- | -------------------------------------------------- | ---------------------------------------- | ------------------------------------------- | | [`bandit`](https://github.com/PyCQA/bandit) | [`ruff-format`](https://github.com/astral-sh/ruff) | [`mypy`](https://github.com/python/mypy) | [`ruff`](https://github.com/astral-sh/ruff) |

πŸ“Š Dataset

This dataset consists of shot level data from the PGA TOUR Championship. The TOUR Championship differs from other tournaments in that only the top 30 golfers compete and there's no cut after the second round, this ensures consistent data of high skill golfers across all 4 rounds. Additionally, it's important to acknowledge that the dataset lacks data from the playoff that occurred, which is crucial for understanding the tournament's conclusion. Furthermore, it is important to emphasize that landing in the rough at East Lake doesn't necessarily disadvantage a player. Despite the challenge it presents, the ball could still have a favorable lie, which might have been strategically chosen by the golfer.

πŸ” EDA

I analyze the data, focusing on feature engineering to understand, clean, and refine the dataset. This process guides model selection and validates assumptions, while also uncovering insights through visualization. By addressing data quality and recognizing patterns early on, I establish a solid foundation for the project. For instance, exploring Strokes Gained (SG) at the round, hole, and drive levels helps us make assumptions for building a model to examine SG on a shot-level basis later.

scipi bokeh



SG per Round

I analyze the Strokes Gained distribution for each round of the Championship, revealing player performance trends during the tournament. This examination on a round-by-round basis helps uncover patterns in golfers' strategies and identifies challenges posed by difficult pin locations on the course.

Key Insights

Event Scatter

(back to top)

SG per Hole

In this analysis, I investigate the distribution of Strokes Gained for each hole of every round of the Championship. Notably, Mahan ties Haas in Strokes Gained on the 72nd hole, a significant moment in the tournament. However, Haas ultimately secured victory in the playoffs!

Key Insights

Event Scatter

(back to top)

SG per Drive

Here I explore the distribution of Strokes Gained vs Driving Distance Gained and Driving Accuracy Gained for each drive of the Championship. Both Driving Distance and Driving Accuracy are normalized per hole before totalling. Happy to say my analysis aligns with Data Golf's Course Fit Tool.

Key Insights

Event Scatter

(back to top)

β›³ Expected Strokes Model

The Stacked Expected Strokes Model leverages the power of ensemble learning by combining predictions from multiple base models to enhance accuracy and robustness. Notably, I've developed separate models for putting and approach scenarios, utilizing different input features tailored to each situation. This approach allows for more precise predictions by optimizing the model's focus on specific aspects of gameplay, ultimately leading to improved performance and insights in golf analytics. Furthermore, this model will eventually enable a granular analysis of shot-by-shot Strokes Gained, a significant departure from previous hole-by-hole and round-by-round evaluations. By harnessing the Stacked Expected Strokes Model's predictive capabilities, I'll unlock the ability to evaluate each shot's impact on overall performance, offering unprecedented insights into golfer performance. Additionally, I'm unconcerned about data leakage since I'll be predicting continuous variables while training on discrete data, ensuring the model's integrity and effectiveness in real-world applications.

mlflow optuna scikit_learn



Model Selection

While the training data is discrete, for continuous predictions, I faced the task of selecting between regression models. As with all my models, I was sure to stratify the training and testing data before predicting. Initially, I employed lazy predict to assess various model options comprehensively.

Key Insight

| Model | Adjusted R-Squared | R-Squared | RMSE | Time Taken | |-----------------------------------|-------|--------|-------|-------| | GradientBoostingRegressor | 0.85 | 0.85 | 0.46 | 0.93 | | HistGradientBoostingRegressor | 0.85 | 0.85 | 0.46 | 0.60 | | LGBMRegressor | 0.85 | 0.85 | 0.47 | 0.14 | | MLPRegressor | 0.84 | 0.84 | 0.48 | 5.23 | | KNeighborsRegressor | 0.82 | 0.83 | 0.50 | 0.16 | | AdaBoostRegressor | 0.82 | 0.83 | 0.50 | 0.49 | | RandomForestRegressor | 0.82 | 0.82 | 0.50 | 3.46 | | XGBRegressor | 0.82 | 0.82 | 0.50 | 0.24 | | BaggingRegressor | 0.81 | 0.81 | 0.52 | 0.37 | | NuSVR | 0.81 | 0.81 | 0.52 | 3.58 | | ExtraTreesRegressor | 0.80 | 0.80 | 0.53 | 2.02 | | SVR | 0.80 | 0.80 | 0.53 | 3.35 |

(back to top)

Training Architecture

After finding the top performing models, I ensemble the best models together using a stack. In this project, I leveraged optuna's CMAES Sampler to not only find the best parameters for each model in the stack resulting in minimized MAE, but also data preprocessing scalers, encoders, imputation, and feature selection methods. All trials are fed with appropriate offline training data from a feast feature store. I utilized an mlflow model registry to track all Optuna trials. Databricks is leveraged to store production ready models. Finally, I wrapped this whole tuning process in a Poetry wheel file called 'OptimizingUtils' for reproducibility.

Event Scatter

(back to top)

Fighting Bias

I attempted to prevent bias by stratifying my training data and by using nested cross stratified split validation to prune biased trials. I plan to go a step further by bootstrapping, implementing imbalanced learning libraries, and exploring Optuna's terminator, distribution, and multiObjectiveStudy feautres. I evaluate model bias that still occurred with shap and lime, enriching our understanding of the model's predictive behavior. Below, you'll find a shap chart for the putting model's LGBMRegressor.

Key Insight

Event Scatter

(back to top)

Model Performance

This chart helps evaluate the model by showing how predicted values compare to actual ones and revealing patterns in prediction errors. The histogram below assesses if errors follow a normal distribution, crucial for reliable predictions.

Key Insight

Event Scatter

(back to top)

🏌🏻 Applying xS Model

Now that we have a stacked SG machine learning model for a shot per shot basis, we can use it to gain valuable insights into golfer performance. Utilizing the model post-training enables golf analysts, coaches, and players to extract actionable insights, optimize strategies, and refine skills. Ultimately, leveraging a model empowers stakeholders to make informed decisions, enhance performance, and drive success on the golf course.

SG per Shot Type

Now that we have a reliable model, we can use it to identify a player's strengths and weaknesses by subtracting Expected Strokes (xS) from the result of each shot to give us true Strokes Gained (SG). The plots below display Woodlands's Total SG and SG Percentile by shot type, providing a clear visualization of his performance across different lies and distances.

Key Insight

Event Scatter


Event Scatter

(back to top)

🎬 Conclusion

Looking back, I wish I had known about Strokes Gained during my time as a caddy. I've come to understand that Strokes Gained provides a more accurate reflection of performance on the hole, while SG Percentiles based on shot location offer deeper insights into a golfer's true abilities. I'm excited to explore more golf-related projects in the future.

(back to top)

πŸ—ΊοΈ Future Roadmap