DeepBlockDeepak / kaggle_titanic

Titanic Survivor Predictor: A multiple machine learning model project to forecast survival outcomes of Titanic passengers. Engineered from historical data, refined with feature selection, tested with CI/CD, and outputs validation metrics.
https://www.kaggle.com/c/titanic
1 stars 0 forks source link

Titanic Survival Prediction Project

Table of Contents

Overview

This project is an approach to the Kaggle Titanic competition, aiming to predict the survival of passengers aboard the Titanic using machine learning techniques. The project involves data preprocessing, feature engineering, model training, and predicting survival outcomes. Models range from traditional models like RandomForest and SVM to custom implementations of decision trees and ensemble methods, to deep learning with PyTorch. My larger aim is to provide a comprehensive comparison of these methods by showcasing their functionalities and resulting performances.

Setup and Prerequisites

To set up the project:

  1. Clone the repository.
  2. Install dependencies specified in pyproject.toml using Poetry with:
    poetry install
  3. Run
    poetry run python main.py

    for training and predictions on the test set or

    poetry run python user_passenger.py

    for custom predictions.

See here how this project uses isort, black, ruff, and unittest for formatting, linting, and testing.

Model Details

Model Performance

The project features three distinct models: RandomForestClassifier, SVM, and a hand-rolled Decision Tree Classifier. Each model has been evaluated for its accuracy:

Visual Insights

Model Accuracy Comparison

Model Accuracy Comparison: A comparative view of the accuracy scores achieved by each model.

Feature Importances Feature Importances: This bar chart ranks the features by their importance in the RandomForestClassifier model. The length of the bar represents the feature's weight in the model, with Title_Mr, Fare, and Age being among the most influential for predicting survival on the Titanic. Notably, Title_Mr emerges as a significant predictor — a result of extracting titles from passenger names and applying one-hot encoding during the preprocessing phase, as defined in src/kaggle_titanic/features.py's extract_title().

Confusion Matrix

Confusion Matrix: (0 for not survived, 1 for survived). The y-axis represents the actual labels.

ROC Curve ROC Curve: Evaluating Model's Diagnostic Ability

Survival Probability Histogram Histogram of Predicted Survival Probabilities

Data Description

The project utilizes the Titanic dataset from Kaggle, obtained via kaggle competitions download -c titanic. From Kaggle:

  • train.csv contains the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the "ground truth".
  • The test.csv dataset contains similar information but does not disclose the "ground truth" for each passenger. It's your job to predict these outcomes.
  • gender_submission.csv: The expected submission format.

Scripts and Functionality

The project is structured to provide a comprehensive approach to the Titanic survival prediction task. Key components and their functionalities are as follows:

Example Usage:

poetry run python main.py --model random_forest
poetry run python main.py --model decision_tree
poetry run python main.py --model svm
poetry run python main.py --model custom_rfc
poetry run python main.py --model naive_bayes
poetry run python main.py --model pytorch
poetry run python main.py # defaults to random_forest

Contributing

Please see CONTRIBUTING.md for guidelines on how to contribute, set up your environment, run tests, and more.

Additional Notes

This project is an initial implementation of the Titanic survival prediction. Ideally, this model will someday achieve a perfect accuracy score!


For more detailed information about the scripts and model training, please refer to the source code within the src/kaggle_titanic package.