DeepBlockDeepak / kaggle_titanic

Titanic Survivor Predictor: A multiple machine learning model project to forecast survival outcomes of Titanic passengers. Engineered from historical data, refined with feature selection, tested with CI/CD, and outputs validation metrics.
https://www.kaggle.com/c/titanic
1 stars 0 forks source link

Incorporate Preprocessing Pipeline #21

Open DeepBlockDeepak opened 5 months ago

DeepBlockDeepak commented 5 months ago

The objective is to refactor current preprocessing and feature engineering workflow to integrate custom feature engineering steps into a scikit-learn pipeline, making the codebase more modular, maintainable, and efficient.

Goals

  1. Integrate Custom Transformers into Preprocessing Pipeline
  1. Update Main Workflow to Use Refactored Pipeline
  1. Validation and Testing
DeepBlockDeepak commented 4 months ago

Runtime Issue

During model training, encountered a ValueError indicating a type mismatch, which prevented successful model fitting:

ValueError: could not convert string to float: 'Boulos, Mrs. Joseph (Sultana)'

This error suggested a fundamental issue in preprocessing pipeline where numeric data was being inadvertently cast to object types.

Every column, including those intended to be numeric (num__*), was cast to object dtype, posing a serious problem for feeding the data into the models that expect numerical input.