Closed jcasoli closed 2 years ago
Hi, I was going through the data and realized that it has multiple variables in categorical format. While there are no missing values, there are multiple entries which mean the same such as : No, No internet service. So, we will have do certain transformations before conduction EDA: (1) Data cleaning (imputation not required coz there are no missing values) (2) Binary transformation and Conversion to numeric format: this will help in estimating the correlation coefficients
Any thing else?
Another point to consider is about splitting the data into train and test sets. I think EDA can be done without splitting and we can split the data once we start transforming for processing. But let me know your thoughts on this.
Thanks @Anupriya-Sri !
Noting here that we agreed to split data before EDA.
Additionally remember to add random_state=*
inside train_test_split()
for reproducibility. We don't want to start any EDA on an non reproducible training set.
Noted the above points and created a pull request to merge EDA files into the main branch. This has EDA for both numerical as well as categorical features. I have done minor data wrangling before performing the EDA.
Hey all, I thought we should get on the same page as to what kind of preliminary EDA we are planning on doing. Part of the proposal requires a description of this and so I can't write about it until I know what we are doing :)
Some of the things that come to mind as interesting to me:
Any other thoughts?