102207429 / Expedia

數據科學與大數據分析
0 stars 0 forks source link

features creation #7

Open GregoryWu opened 7 years ago

GregoryWu commented 7 years ago

Maybe we can create some features by ourselves. For example, for df in combine: df['life_pct']=df['life_sq']/df['full_sq'].astype(float) df['rel_kitch']=df['kitch_sq']/df['full_sq'].astype(float) # area of kitchen occupies the whole area df['rel_floor']=df['floor']/df['max_floor'].astype(float) # average floor numbers

and for time and sub.area:

for df in combine: month_year = (df.timestamp.dt.month + (df.timestamp.dt.year)*100) month_year_cnt_map = month_year.value_counts().to_dict() df['month_year_cnt'] = month_year.map(month_year_cnt_map)

week_year = (df.timestamp.dt.weekofyear + (df.timestamp.dt.year)*100)
week_year_cnt_map = week_year.value_counts().to_dict()
df['week_year_cnt'] = week_year.map(week_year_cnt_map)
df['dow'] = df.timestamp.dt.dayofweek
df['month'] = df.timestamp.dt.month

it uses the frequency count of each area, the higher freq nominal value gets the higher weight.

bairro = df.sub_area 
bairro_cnt_map = bairro.value_counts().to_dict()
df['bairro_cnt'] = bairro.map(bairro_cnt_map)
df = df.drop('sub_area',1)
102207429 commented 7 years ago

I like the idea to count the frequency of each area. But I think I get a little confuse of what "df" mean...

GregoryWu commented 7 years ago

df stands for dataframe. that's just a pseudo code.

102207429 commented 7 years ago

Oh, I see. Thanks.