MShooshtari / Kaggle_House_Prices

Link to the Kaggle Competition: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview
0 stars 0 forks source link

Normalizing a column with many zeros #1

Closed MShooshtari closed 5 years ago

MShooshtari commented 5 years ago

Hey guys, What is the best way to normalize a column that has many zeros? It happens when many houses do not have that feature, such as:

ScreenPorch: Screen porch area in square feet OpenPorchSF: Open porch area in square feet EnclosedPorch: Enclosed porch area in square feet 3SsnPorch: Three season porch area in square feet BsmtFinSF1: Type 1 finished square feet LowQualFinSF: Low quality finished square feet (all floors) MiscVal: $Value of miscellaneous feature

bhowmickpb commented 5 years ago

I think the best option would be if we find most of the values are zero in that specific column, then we can drop that feature from our model. I did a small analysis to see what proportion of data are zero for those columns:

ScreenPorch 0 0.920548 OpenPorchSF 0 0.449315 EnclosedPorch 0 0.857534 3SsnPorch 0 0.983562 BsmtFinSF1 0 0.319863 LowQualFinSF 0 0.982192 MiscVal 0 0.964384

I will drop 5 columns from dataframe because they don't have any influence on the outcome of the price!

selkurdy commented 5 years ago

To save on the time to explore the data, I would like to suggest that you all pip install pandas-profiling module. It generates an html report which summarizes all what we might need to investigate further. I will attach a jupyter notebook with my attempt at creating a dict as what Mahdi was doing. It also demonstrates how pandas-profiling works. Please give it a try, it will save you some time. I am finding it impossible to attach an ipynb file!! Even can't attach an md file? This system is not flexible.!!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook
import seaborn as sns
import pandas_profiling as pf
h = pd.read_csv('train.csv')
h.head().T
ctypeint = h.select_dtypes(include='int64').columns
ctypefloat = h.select_dtypes(include='float64').columns
ctypeobj = h.select_dtypes(include='O').columns
print(f'int64: {ctypeint}  \n\n float64:\n {ctypefloat} \n\n object:\n {ctypeobj}')
house_profile = pf.ProfileReport(h)
house_profile.to_file(outputfile='houseprofile.html')

Creates a dict out of the various types

coltypes = {'i': ctypeint,'f':ctypefloat, 'o': ctypeobj}
selkurdy commented 5 years ago
print(f"o: {len(coltypes['o'])}  f: {len(coltypes['f'])} i: {len(coltypes['i'])}")

o: 43 f: 3 i: 35

selkurdy commented 5 years ago

Mahdi, Why do you think it is significant to encode the categorical data as ordinal? Do you plan on doing any math comparison? I have found reasonable success with frequency encoding. It converts categorical data to numerics based on its frequency and you can, if you think it appropriate do math on it too. Just a suggestion.

selkurdy commented 5 years ago

I agree with Pallab that some columns need to be dropped. The numeric columns with zeros need to be filled in with the median of the values, not the mean, because most of the data are skewed.

As for the categorical data, we fill the missing by the mode, as found on stackoverflow df = df.fillna(df.mode().iloc[0])

MShooshtari commented 5 years ago

I think the best option would be if we find most of the values are zero in that specific column, then we can drop that feature from our model. I did a small analysis to see what proportion of data are zero for those columns:

ScreenPorch 0 0.920548 OpenPorchSF 0 0.449315 EnclosedPorch 0 0.857534 3SsnPorch 0 0.983562 BsmtFinSF1 0 0.319863 LowQualFinSF 0 0.982192 MiscVal 0 0.964384

I will drop 5 columns from dataframe because they don't have any influence on the outcome of the price!

Nice idea. I guess that can be true with any value that has dominance in any column. For example, in any column, if a value repeats 95% of the time (or any similar threshold), we should remove that column, correct?

MShooshtari commented 5 years ago

To save on the time to explore the data, I would like to suggest that you all pip install pandas-profiling module. It generates an html report which summarizes all what we might need to investigate further. I will attach a jupyter notebook with my attempt at creating a dict as what Mahdi was doing. It also demonstrates how pandas-profiling works. Please give it a try, it will save you some time. I am finding it impossible to attach an ipynb file!! Even can't attach an md file? This system is not flexible.!!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook
import seaborn as sns
import pandas_profiling as pf
h = pd.read_csv('train.csv')
h.head().T
ctypeint = h.select_dtypes(include='int64').columns
ctypefloat = h.select_dtypes(include='float64').columns
ctypeobj = h.select_dtypes(include='O').columns
print(f'int64: {ctypeint}  \n\n float64:\n {ctypefloat} \n\n object:\n {ctypeobj}')
house_profile = pf.ProfileReport(h)
house_profile.to_file(outputfile='houseprofile.html')

Creates a dict out of the various types

coltypes = {'i': ctypeint,'f':ctypefloat, 'o': ctypeobj}

Thanks, I will give it a try :)

MShooshtari commented 5 years ago

Based on the following code I noticed that many of the columns have this issue, and I removed them:

drop_thresh = 0.8
drop_list = []
for feature in train_df:
    col_df = train_df[feature]
    count_nan = col_df.isnull().sum()
    nan_ratio = count_nan/len(col_df)   
    repeats = train_df.pivot_table(index=[feature], aggfunc='size').sort_values()
    max_repeat_ratio = repeats.max()/len(col_df)
    if (nan_ratio>drop_thresh or max_repeat_ratio>drop_thresh):
        drop_list.append(feature)

train_df_repeats_removed = train_df.drop(drop_list, axis=1)

print(train_df.shape)
print(train_df_repeats_removed.shape)

I am closing this issue as I got my answer :)