Generally, bar charts were created for categorical variables, and histograms for numerical variables to show illustration. Correlation heatmaps based on two different metrics were generated to investigate the relationships between numerical variables. A scatter plot specifically for pdays vs previous was created.
Judging from the proportion of each class in the target, the dataset is unbalanced
job, education, contact and poutcome contain unknown values. We do not have enough information on the dataset to impute these values properly. Note that these values are not null values, but strings called "unknown".
Out of the columns mentioned that contain unknown values, contact and poutcome need to be dropped since they contain too many unknown examples. We cannot just drop the unknowns from these columns since we would be dropping too many examples, especially considering the size of the data.
job and education can be kept. We can just drop the unknowns from these features.
The distributions of pdays and previous are heavily skewed. These variables are also correlated with 0.99 Spearman correlation score and 0.44 Pearson correlation score.
However, upon visual inspection with a scatter plot, pdays and previous do not seem to be too correlated to be an issue. We can keep them both as features.
Summary and Recommendations from EDA
pdays
vsprevious
was created.job
,education
,contact
andpoutcome
contain unknown values. We do not have enough information on the dataset to impute these values properly. Note that these values are not null values, but strings called "unknown".contact
andpoutcome
need to be dropped since they contain too many unknown examples. We cannot just drop the unknowns from these columns since we would be dropping too many examples, especially considering the size of the data.job
andeducation
can be kept. We can just drop the unknowns from these features.pdays
andprevious
are heavily skewed. These variables are also correlated with 0.99 Spearman correlation score and 0.44 Pearson correlation score.pdays
andprevious
do not seem to be too correlated to be an issue. We can keep them both as features.contact
andpoutcome
job
andeducation
education