Preliminary EDA? - Githubissues

UBC-MDS / Telco_Customer_Churn_Prediction_Group12

Data Analytics project performed by Group 12 for DSCI 522

MIT License

2 stars 7 forks source link

Preliminary EDA? #3

Closed jcasoli closed 2 years ago

jcasoli commented 2 years ago

Hey all, I thought we should get on the same page as to what kind of preliminary EDA we are planning on doing. Part of the proposal requires a description of this and so I can't write about it until I know what we are doing :)

Some of the things that come to mind as interesting to me:

Is there class imbalance of some sort?
What kind of transformations will we need to do? Does our data need imputation?
Trying to get an initial feel for which features are correlated with each target class

Any other thoughts?

Anupriya-Sri commented 2 years ago

Hi, I was going through the data and realized that it has multiple variables in categorical format. While there are no missing values, there are multiple entries which mean the same such as : No, No internet service. So, we will have do certain transformations before conduction EDA: (1) Data cleaning (imputation not required coz there are no missing values) (2) Binary transformation and Conversion to numeric format: this will help in estimating the correlation coefficients

Any thing else?

Anupriya-Sri commented 2 years ago

Another point to consider is about splitting the data into train and test sets. I think EDA can be done without splitting and we can split the data once we start transforming for processing. But let me know your thoughts on this.

jcasoli commented 2 years ago

Thanks @Anupriya-Sri !

Noting here that we agreed to split data before EDA.

adammorphy commented 2 years ago

Additionally remember to add random_state=* inside train_test_split() for reproducibility. We don't want to start any EDA on an non reproducible training set.

Anupriya-Sri commented 2 years ago

Noted the above points and created a pull request to merge EDA files into the main branch. This has EDA for both numerical as well as categorical features. I have done minor data wrangling before performing the EDA.