Tolajoy / Data-Analyst2022

0 stars 0 forks source link

Titanic Prediction #4

Open Tolajoy opened 1 year ago

Tolajoy commented 1 year ago

The problem statement for the data science project using the Titanic dataset can be defined as follows:

The goal of this project is to analyze the Titanic dataset, which contains information about passengers on the Titanic, and build a predictive model that can accurately predict whether a passenger survived or not based on their attributes.

By developing this model, we aim to gain insights into the factors that influenced survival on the Titanic.

Specifically, the problem involves the following steps:

Data Analysis and Exploration: Explore and analyze the dataset to understand its structure, identify relevant features,

and uncover any patterns or relationships.

Data Preprocessing: Preprocess the dataset by handling missing values, encoding categorical variables, and normalizing or scaling numerical features.

Feature Selection and Engineering: Select the most relevant features that are likely to have a significant impact on survival prediction. Optionally, create new features or transform existing ones to improve the model's performance.

Model Building: Select an appropriate machine learning algorithm, such as logistic regression, decision trees, or random forests, and train the model using the preprocessed dataset.

Model Evaluation: Evaluate the performance of the model using appropriate evaluation metrics, such as accuracy, precision, recall, or F1 score, to measure its effectiveness in predicting survival.

Hyperparameter Tuning: Fine-tune the model by adjusting the hyperparameters to optimize its performance. This can be done using techniques like cross-validation and grid search.

Model Deployment: Once the model is trained and optimized, deploy it to make predictions on new, unseen data.

By addressing this problem statement, we aim to build a reliable and accurate predictive model that can classify whether a passenger on the Titanic survived or not based on their attributes. This can provide valuable insights into the factors that contributed to survival and can be applicable to similar scenarios or datasets.

Import Libraries

Import the necessary libraries for data manipulation, visualization, and modeling.

import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score ​

Load the Data

Load the Titanic dataset from the specified file path using Pandas.

train_data = pd.read_csv(r'C:\Users\SHOPINVERSE\Downloads\titanic\train.csv') test_data = pd.read_csv(r'C:\Users\SHOPINVERSE\Downloads\titanic\test.csv') ​

Explore the Data

Perform exploratory data analysis to understand the dataset and gain insights

print(train_data.head()) print(train_data.info()) print(train_data.describe()) ​ PassengerId Survived Pclass \ 0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

                                            Name     Sex   Age  SibSp  \

0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0

Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns):

Column Non-Null Count Dtype


0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB None PassengerId Survived Pclass Age SibSp \ count 891.000000 891.000000 891.000000 714.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008
std 257.353842 0.486592 0.836071 14.526497 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000

        Parch        Fare  

count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200

Data Preprocessing

Clean and preprocess the data to handle missing values and transform categorical variables.

Fill missing values with the median age

train_data['Age'].fillna(train_data['Age'].median(), inplace=True) test_data['Age'].fillna(test_data['Age'].median(), inplace=True) ​

Encode categorical variables

label_encoder = LabelEncoder() train_data['Sex'] = label_encoder.fit_transform(train_data['Sex']) test_data['Sex'] = label_encoder.transform(test_data['Sex']) ​

Feature Selection and Engineering (Optional)

Select relevant features or engineer new features based on domain knowledge or data analysis.

Extract title from the Name column

train_data['Title'] = train_data['Name'].str.extract(' ([A-Za-z]+).', expand=False) test_data['Title'] = test_data['Name'].str.extract(' ([A-Za-z]+).', expand=False) ​

Create a new feature 'FamilySize' by adding SibSp and Parch

train_data['FamilySize'] = train_data['SibSp'] + train_data['Parch'] + 1 test_data['FamilySize'] = test_data['SibSp'] + test_data['Parch'] + 1 ​

Visualize the Data

Create visualizations to gain insights and understand the relationships between variables.

Survival count by sex

sns.countplot(x='Sex', hue='Survived', data=train_data) plt.title('Survival Count by Sex') plt.show() ​

Survival rate by passenger class

sns.barplot(x='Pclass', y='Survived', data=train_data) plt.title('Survival Rate by Passenger Class') plt.show() ​

Model Training and Evaluation

Split the training data into training and validation sets, train a machine learning model, and evaluate its performance.

​ ​

Select features for modeling

features = ['Pclass', 'Sex', 'Age', 'Fare', 'FamilySize'] target = 'Survived' ​

Split the data into training and validation sets

X_train, X_val, y_train, y_val = train_test_split(train_data[features], train_data[target], test_size=0.2, random_state=42) ​

Train a random forest classifier

model = RandomForestClassifier(random_state=42) model.fit(X_train, y_train) ​

Make predictions on the validation set

y_pred = model.predict(X_val) ​

Evaluate the model's performance

accuracy = accuracy_score(y_val, y_pred) print("Validation Accuracy:", accuracy) ​ Validation Accuracy: 0.8044692737430168

Cross-Validation

Perform cross-validation to get a more reliable estimate of the model's performance.

This helps in assessing the model's

generalization ability and reduces the dependence on a single train-test split.

from sklearn.model_selection import cross_val_score ​

Perform cross-validation

scores = cross_val_score(model, X_train, y_train, cv=5) print("Cross-Validation Accuracy: {:.4f}".format(scores.mean())) ​ Cross-Validation Accuracy: 0.8034

Feature Selection

Evaluate the importance of each feature using the featureimportances attribute of the trained model.

Remove less informative or redundant features to reduce noise and improve model performance.

Get feature importances

importance = model.featureimportances

Create a dataframe of feature importance

feature_importance = pd.DataFrame({'Feature': features, 'Importance': importance}) ​

Sort the features by importance in descending order

feature_importance = feature_importance.sort_values(by='Importance', ascending=False) ​

Print the feature importances

print(feature_importance) ​ Feature Importance 3 Fare 0.305203 1 Sex 0.280227 2 Age 0.258801 0 Pclass 0.083730 4 FamilySize 0.072040