GitHub | Kaggle | LinkedIn
Author: Declan Costello

TOUR Championship Analysis

Table of Contents

Objectives
Repo Overview
Code Quality
Dataset
EDA

SG per Round
SG per Hole
SG per Drive

Expected Strokes Model

Model Selection
Training Architecture
Fighting Bias
Model Performance

Applying xS Model

SG per Shot Type

Conclusion
Future Roadmap

🎯 Objectives

Welcome to my analysis of the 2011 TOUR Championship at East Lake Golf Club, the primary objective of this project is to:

Develop an expected strokes model to identify player performance

I hope to contribute meaningful insights to the golf community through this project. Although the 2011 TOUR Championship took place over a decade ago and the tournament's rules have since changed, its extensive shot-level dataset remains a valuable resource. If you happen to come across another complete shot-level dataset, I would greatly appreciate it if you could share it with me! I encourage you to check out the js visuals on NBViewer and Streamlit Dashboard!

🌵 Repo Structure

This repo is organized as follows:

📂 TOUR-Championship-Strokes-Gained-Analysis 📍
│
├── 📂 Data
├── CITATION
├── README.md
├── CODE_OF_CONDUCT.md
│
├── 📂 EDA
│   ├── EDA.ipynb
│   ├── SGperHole.ipynb
│   ├── SGperRound.ipynb
│   ├── SGperDrive.ipynb
│   ├── FeatureEngineering.ipynb
│   └── 📂 EDAUtils
│
├── 📂 Creating Model
│   ├── LazyPredict.ipynb 
│   ├── PuttingModel.ipynb
│   ├── ApproachModel.ipynb 
│   └── 📂 OptimizingUtils
│
├── 📂 Applying Model
│   ├── SGCreation.ipynb 
│   └── SGAnalysis.ipynb
│
└── 📂 Streamlit Dashboard

⭐ Code Quality

In this project, a Security Linter, Code Formatting, Type Checking, and Code Linting are essential for ensuring code quality and robustness. These help identify and mitigate security vulnerabilities, maintain consistent coding styles, enforce type safety, and detect potential errors or issues early in the development process, ultimately enhancing the reliability and maintainability of the project.

| Security Linter | Code Formatting | Type Checking | Code Linting | | ------------------------------------------- | -------------------------------------------------- | ---------------------------------------- | ------------------------------------------- | | [`bandit`](https://github.com/PyCQA/bandit) | [`ruff-format`](https://github.com/astral-sh/ruff) | [`mypy`](https://github.com/python/mypy) | [`ruff`](https://github.com/astral-sh/ruff) |

📊 Dataset

This dataset consists of shot level data from the PGA TOUR Championship. The TOUR Championship differs from other tournaments in that only the top 30 golfers compete and there's no cut after the second round, this ensures consistent data of high skill golfers across all 4 rounds. Additionally, it's important to acknowledge that the dataset lacks data from the playoff that occurred, which is crucial for understanding the tournament's conclusion. Furthermore, it is important to emphasize that landing in the rough at East Lake doesn't necessarily disadvantage a player. Despite the challenge it presents, the ball could still have a favorable lie, which might have been strategically chosen by the golfer.

🔍 EDA

I analyze the data, focusing on feature engineering to understand, clean, and refine the dataset. This process guides model selection and validates assumptions, while also uncovering insights through visualization. By addressing data quality and recognizing patterns early on, I establish a solid foundation for the project. For instance, exploring Strokes Gained (SG) at the round, hole, and drive levels helps us make assumptions for building a model to examine SG on a shot-level basis later.

SG per Round

I analyze the Strokes Gained distribution for each round of the Championship, revealing player performance trends during the tournament. This examination on a round-by-round basis helps uncover patterns in golfers' strategies and identifies challenges posed by difficult pin locations on the course.

Key Insights

All rounds have a promising mean of 0
Round 3 seemed to be the most chaotic, as there was a significant variance in player performance throughout the day

(back to top)

SG per Hole

In this analysis, I investigate the distribution of Strokes Gained for each hole of every round of the Championship. Notably, Mahan ties Haas in Strokes Gained on the 72nd hole, a significant moment in the tournament. However, Haas ultimately secured victory in the playoffs!

Key Insights

Players appear to continue to play relative to their initial performance of round 1
Poorly performing players seem to give up come the back 9 of round 3

(back to top)

SG per Drive

Here I explore the distribution of Strokes Gained vs Driving Distance Gained and Driving Accuracy Gained for each drive of the Championship. Both Driving Distance and Driving Accuracy are normalized per hole before totalling. Happy to say my analysis aligns with Data Golf's Course Fit Tool.

Key Insights

Driving Accuracy has a strong correlation to Strokes Gained per Hole
Driving Distance has only a slight correlation to Strokes Gained per Hole

(back to top)

⛳ Expected Strokes Model

The Stacked Expected Strokes Model leverages the power of ensemble learning by combining predictions from multiple base models to enhance accuracy and robustness. Notably, I've developed separate models for putting and approach scenarios, utilizing different input features tailored to each situation. This approach allows for more precise predictions by optimizing the model's focus on specific aspects of gameplay, ultimately leading to improved performance and insights in golf analytics. Furthermore, this model will eventually enable a granular analysis of shot-by-shot Strokes Gained, a significant departure from previous hole-by-hole and round-by-round evaluations. By harnessing the Stacked Expected Strokes Model's predictive capabilities, I'll unlock the ability to evaluate each shot's impact on overall performance, offering unprecedented insights into golfer performance. Additionally, I'm unconcerned about data leakage since I'll be predicting continuous variables while training on discrete data, ensuring the model's integrity and effectiveness in real-world applications.

Model Selection

While the training data is discrete, for continuous predictions, I faced the task of selecting between regression models. As with all my models, I was sure to stratify the training and testing data before predicting. Initially, I employed lazy predict to assess various model options comprehensively.

Key Insight

The GradientBoostingRegressor and HistGradientBoostingRegressor models performed the best
If I were to have to constantly retrain the model I would avoid the MLPRegressor as it takes forever

| Model | Adjusted R-Squared | R-Squared | RMSE | Time Taken | |-----------------------------------|-------|--------|-------|-------| | GradientBoostingRegressor | 0.85 | 0.85 | 0.46 | 0.93 | | HistGradientBoostingRegressor | 0.85 | 0.85 | 0.46 | 0.60 | | LGBMRegressor | 0.85 | 0.85 | 0.47 | 0.14 | | MLPRegressor | 0.84 | 0.84 | 0.48 | 5.23 | | KNeighborsRegressor | 0.82 | 0.83 | 0.50 | 0.16 | | AdaBoostRegressor | 0.82 | 0.83 | 0.50 | 0.49 | | RandomForestRegressor | 0.82 | 0.82 | 0.50 | 3.46 | | XGBRegressor | 0.82 | 0.82 | 0.50 | 0.24 | | BaggingRegressor | 0.81 | 0.81 | 0.52 | 0.37 | | NuSVR | 0.81 | 0.81 | 0.52 | 3.58 | | ExtraTreesRegressor | 0.80 | 0.80 | 0.53 | 2.02 | | SVR | 0.80 | 0.80 | 0.53 | 3.35 |

(back to top)

Training Architecture

After finding the top performing models, I ensemble the best models together using a stack. In this project, I leveraged optuna's CMAES Sampler to not only find the best parameters for each model in the stack resulting in minimized MAE, but also data preprocessing scalers, encoders, imputation, and feature selection methods. All trials are fed with appropriate offline training data from a feast feature store. I utilized an mlflow model registry to track all Optuna trials. Databricks is leveraged to store production ready models. Finally, I wrapped this whole tuning process in a Poetry wheel file called 'OptimizingUtils' for reproducibility.

(back to top)

Fighting Bias

I attempted to prevent bias by stratifying my training data and by using nested cross stratified split validation to prune biased trials. I plan to go a step further by bootstrapping, implementing imbalanced learning libraries, and exploring Optuna's terminator, distribution, and multiObjectiveStudy feautres. I evaluate model bias that still occurred with shap and lime, enriching our understanding of the model's predictive behavior. Below, you'll find a shap chart for the putting model's LGBMRegressor.

Key Insight

Super surprised to see "Distance to Edge" matters more than "Distance to Pin" for putting, curious if this would be the case if I had a larger dataset
"Downhill Slope" and "Elevation Below Ball" are distinct features; Despite their seemingly similar title, they are not the same. To confirm this, a pairwise correlation was done

(back to top)

Model Performance

This chart helps evaluate the model by showing how predicted values compare to actual ones and revealing patterns in prediction errors. The histogram below assesses if errors follow a normal distribution, crucial for reliable predictions.

Key Insight

Excited to see the residuals have a low standard deviation with a mean hovering around 0

(back to top)

🏌🏻 Applying xS Model

Now that we have a stacked SG machine learning model for a shot per shot basis, we can use it to gain valuable insights into golfer performance. Utilizing the model post-training enables golf analysts, coaches, and players to extract actionable insights, optimize strategies, and refine skills. Ultimately, leveraging a model empowers stakeholders to make informed decisions, enhance performance, and drive success on the golf course.

SG per Shot Type

Now that we have a reliable model, we can use it to identify a player's strengths and weaknesses by subtracting Expected Strokes (xS) from the result of each shot to give us true Strokes Gained (SG). The plots below display Woodlands's Total SG and SG Percentile by shot type, providing a clear visualization of his performance across different lies and distances.

Key Insight

Woodland was very successful gaining strokes on the green

By looking at Woodland's SG Percentile, we can see that he truly underperformed from 200+ yards out, opposed to having one or two shots damage his 200+ SG Total
Woodland only had six shots within 100-50 yards of the pin, perhaps this was by design to avoid putting himself in a position where he consistently underperforms

(back to top)

🎬 Conclusion

Looking back, I wish I had known about Strokes Gained during my time as a caddy. I've come to understand that Strokes Gained provides a more accurate reflection of performance on the hole, while SG Percentiles based on shot location offer deeper insights into a golfer's true abilities. I'm excited to explore more golf-related projects in the future.

(back to top)

🗺️ Future Roadmap

[ ] Model Refinement
- [x] CI Orchestration
- [x] Model Registry
- [ ] Drift Detection
- [ ] Feature Store
- [ ] Deploy
[ ] External Data
- [ ] Player Course History
- [ ] Career Earnings
- [ ] Equipment
- [ ] Biometrics
- [ ] Weather
- [x] SVGs
- [x] HCP
[ ] Bayesian Integration
- [x] Refer To
- [x] Watch
- [ ] Utilize
  
  (back to top)

dec1costello / TOUR-Championship-Strokes-Gained-Analysis

readme

TOUR Championship Analysis

🎯 Objectives

🌵 Repo Structure

⭐ Code Quality

📊 Dataset

🔍 EDA

SG per Round

Key Insights

SG per Hole

Key Insights

SG per Drive

Key Insights

⛳ Expected Strokes Model

Model Selection

Key Insight

Training Architecture

Fighting Bias

Key Insight

Model Performance

Key Insight

🏌🏻 Applying xS Model

SG per Shot Type

Key Insight

🎬 Conclusion

🗺️ Future Roadmap