JoshVarty / KaggleUtils

A collection of utilities I use for EDA, feature engineering etc.
MIT License
1 stars 1 forks source link

Summary information on unbalanced categories, NAs and unique values #1

Closed JoshVarty closed 5 years ago

JoshVarty commented 5 years ago

Something similar to:

From:

# https://www.kaggle.com/artgor/is-this-malware-eda-fe-and-lgb-updated
stats = []
for col in train.columns:
    stats.append((col, train[col].nunique(), train[col].isnull().sum() * 100 / train.shape[0], train[col].value_counts(normalize=True, dropna=False).values[0] * 100, train[col].dtype))

stats_df = pd.DataFrame(stats, columns=['Feature', 'Unique_values', 'Percentage of missing values', 'Percentage of values in the biggest category', 'type'])
stats_df.sort_values('Percentage of missing values', ascending=False)