The problem statement for the data science project using the Titanic dataset can be defined as follows:
The goal of this project is to analyze the Titanic dataset, which contains information about passengers on the Titanic, and build a predictive model that can accurately predict whether a passenger survived or not based on their attributes.
By developing this model, we aim to gain insights into the factors that influenced survival on the Titanic.
Specifically, the problem involves the following steps:
Data Analysis and Exploration: Explore and analyze the dataset to understand its structure, identify relevant features,
and uncover any patterns or relationships.
Data Preprocessing: Preprocess the dataset by handling missing values, encoding categorical variables, and normalizing or scaling numerical features.
Feature Selection and Engineering: Select the most relevant features that are likely to have a significant impact on survival prediction. Optionally, create new features or transform existing ones to improve the model's performance.
Model Building: Select an appropriate machine learning algorithm, such as logistic regression, decision trees, or random forests, and train the model using the preprocessed dataset.
Model Evaluation: Evaluate the performance of the model using appropriate evaluation metrics, such as accuracy, precision, recall, or F1 score, to measure its effectiveness in predicting survival.
Hyperparameter Tuning: Fine-tune the model by adjusting the hyperparameters to optimize its performance. This can be done using techniques like cross-validation and grid search.
Model Deployment: Once the model is trained and optimized, deploy it to make predictions on new, unseen data.
By addressing this problem statement, we aim to build a reliable and accurate predictive model that can classify whether a passenger on the Titanic survived or not based on their attributes. This can provide valuable insights into the factors that contributed to survival and can be applicable to similar scenarios or datasets.
Import Libraries
Import the necessary libraries for data manipulation, visualization, and modeling.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
Load the Data
Load the Titanic dataset from the specified file path using Pandas.
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
The problem statement for the data science project using the Titanic dataset can be defined as follows:
The goal of this project is to analyze the Titanic dataset, which contains information about passengers on the Titanic, and build a predictive model that can accurately predict whether a passenger survived or not based on their attributes.
By developing this model, we aim to gain insights into the factors that influenced survival on the Titanic.
Specifically, the problem involves the following steps:
Data Analysis and Exploration: Explore and analyze the dataset to understand its structure, identify relevant features,
and uncover any patterns or relationships.
Data Preprocessing: Preprocess the dataset by handling missing values, encoding categorical variables, and normalizing or scaling numerical features.
Feature Selection and Engineering: Select the most relevant features that are likely to have a significant impact on survival prediction. Optionally, create new features or transform existing ones to improve the model's performance.
Model Building: Select an appropriate machine learning algorithm, such as logistic regression, decision trees, or random forests, and train the model using the preprocessed dataset.
Model Evaluation: Evaluate the performance of the model using appropriate evaluation metrics, such as accuracy, precision, recall, or F1 score, to measure its effectiveness in predicting survival.
Hyperparameter Tuning: Fine-tune the model by adjusting the hyperparameters to optimize its performance. This can be done using techniques like cross-validation and grid search.
Model Deployment: Once the model is trained and optimized, deploy it to make predictions on new, unseen data.
By addressing this problem statement, we aim to build a reliable and accurate predictive model that can classify whether a passenger on the Titanic survived or not based on their attributes. This can provide valuable insights into the factors that contributed to survival and can be applicable to similar scenarios or datasets.
Import Libraries
Import the necessary libraries for data manipulation, visualization, and modeling.
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score
Load the Data
Load the Titanic dataset from the specified file path using Pandas.
train_data = pd.read_csv(r'C:\Users\SHOPINVERSE\Downloads\titanic\train.csv') test_data = pd.read_csv(r'C:\Users\SHOPINVERSE\Downloads\titanic\test.csv')
Explore the Data
Perform exploratory data analysis to understand the dataset and gain insights
print(train_data.head()) print(train_data.info()) print(train_data.describe()) PassengerId Survived Pclass \ 0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns):
Column Non-Null Count Dtype
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB None PassengerId Survived Pclass Age SibSp \ count 891.000000 891.000000 891.000000 714.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008
std 257.353842 0.486592 0.836071 14.526497 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200
Data Preprocessing
Clean and preprocess the data to handle missing values and transform categorical variables.
Fill missing values with the median age
train_data['Age'].fillna(train_data['Age'].median(), inplace=True) test_data['Age'].fillna(test_data['Age'].median(), inplace=True)
Encode categorical variables
label_encoder = LabelEncoder() train_data['Sex'] = label_encoder.fit_transform(train_data['Sex']) test_data['Sex'] = label_encoder.transform(test_data['Sex'])
Feature Selection and Engineering (Optional)
Select relevant features or engineer new features based on domain knowledge or data analysis.
Extract title from the Name column
train_data['Title'] = train_data['Name'].str.extract(' ([A-Za-z]+).', expand=False) test_data['Title'] = test_data['Name'].str.extract(' ([A-Za-z]+).', expand=False)
Create a new feature 'FamilySize' by adding SibSp and Parch
train_data['FamilySize'] = train_data['SibSp'] + train_data['Parch'] + 1 test_data['FamilySize'] = test_data['SibSp'] + test_data['Parch'] + 1
Visualize the Data
Create visualizations to gain insights and understand the relationships between variables.
Survival count by sex
sns.countplot(x='Sex', hue='Survived', data=train_data) plt.title('Survival Count by Sex') plt.show()
Survival rate by passenger class
sns.barplot(x='Pclass', y='Survived', data=train_data) plt.title('Survival Rate by Passenger Class') plt.show()
Model Training and Evaluation
Split the training data into training and validation sets, train a machine learning model, and evaluate its performance.
Select features for modeling
features = ['Pclass', 'Sex', 'Age', 'Fare', 'FamilySize'] target = 'Survived'
Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(train_data[features], train_data[target], test_size=0.2, random_state=42)
Train a random forest classifier
model = RandomForestClassifier(random_state=42) model.fit(X_train, y_train)
Make predictions on the validation set
y_pred = model.predict(X_val)
Evaluate the model's performance
accuracy = accuracy_score(y_val, y_pred) print("Validation Accuracy:", accuracy) Validation Accuracy: 0.8044692737430168
Cross-Validation
Perform cross-validation to get a more reliable estimate of the model's performance.
This helps in assessing the model's
generalization ability and reduces the dependence on a single train-test split.
from sklearn.model_selection import cross_val_score
Perform cross-validation
scores = cross_val_score(model, X_train, y_train, cv=5) print("Cross-Validation Accuracy: {:.4f}".format(scores.mean())) Cross-Validation Accuracy: 0.8034
Feature Selection
Evaluate the importance of each feature using the featureimportances attribute of the trained model.
Remove less informative or redundant features to reduce noise and improve model performance.
Get feature importances
importance = model.featureimportances
Create a dataframe of feature importance
feature_importance = pd.DataFrame({'Feature': features, 'Importance': importance})
Sort the features by importance in descending order
feature_importance = feature_importance.sort_values(by='Importance', ascending=False)
Print the feature importances
print(feature_importance) Feature Importance 3 Fare 0.305203 1 Sex 0.280227 2 Age 0.258801 0 Pclass 0.083730 4 FamilySize 0.072040