EDA Elements - Githubissues

chuangw6 commented 3 years ago

For the EDA process, please add the items that you find necessary to include in EDA.

huan-ds commented 3 years ago

I wrote this in the introduction for EDA: "The data set is clean as the missing values have been removed and the continuous values have been scaled. The features and target variable are all continuous, thus the exploratory data analysis will focus on the distribution of the features and the correlations between the features. "

charlessuresh commented 3 years ago

Hey Guys! I’ve done some preliminary EDA. It is by no means complete. I need some help. I’m also not very well versed with Altair (I’m more comfortable with seaborn) Here are some of the issues that I’m facing:

How to set customized X-axis labels for repeated Histogram and Pairplot of Quantitative Features? Currently all X-axis labels have the word binned in them. I would like to remove them.
How to set X-axis labels only for the pairplots in the bottom-most row of repeated plots? Currently the X-axis labels are repeated for all rows of the pair plot
How to set Y-axis labels only for the pairplots in the left-most row of repeated plots? Currently the Y-axis labels are repeated for all columns of the pair plot

huan-ds commented 3 years ago

For the latest version, I suggest to remove the following graphs:

Scatter Plot of 'Age' with all Numerical Columns
Histogram of Numerical Columns

Because "Pairplot of Numerical Columns" have contained most information. And we don't want to put redundant information in the final version.

charlessuresh commented 3 years ago

For EDA of numerical features, this is what I'm used to plotting:

Histogram of individual features: To check if each of the individual features is normally distributed, or does it have some other distribution? Is it unimodal/ bimodal? Is there skewness of some sort?
Boxplot: to check presence of outliers + skewness

I sometimes replace 1 and 2 with a violin plot, since it combines the functionality of a density plot and a boxplot

Pairplot/scatterplot + Correlation plot: to understand the interaction between pairs of numerical features

If you guys feel that the histogram is unnecessary I'm okay with it too

chuangw6 commented 3 years ago

I agree with Charles. We can put these plots in the EDA which give us different perspectives of the dataset.

huan-ds commented 3 years ago

Issue solved in milestone 1.

UBC-MDS / Abalone_Age_Prediction

EDA Elements #3