data cleaning work - Githubissues

Talk-Data-2-Me / SLMR

https://www.kaggle.com/c/springleaf-marketing-response

0 stars 0 forks source link

data cleaning work #5

Open xiaoxiding opened 8 years ago

xiaoxiding commented 8 years ago

Categorical: create dummy (A, B, C; Cat_A = 0 or 1; Cat_B = 0 or 1; Cat_C=0 or 1) - Lin has code Dummy: 0 or 1 (no missing)

Numeric: treat missing (0, median) - Lin has code Create missing indicator (optional) Cap and floor (1%, 99%) - Lin to research if tree need cap and floor. Evgeny search code to do it

Variable list (should be a reasonable list): loop it through – Evgeny has categorical list

New list of all variables: contains dummy and numeric

HappyAWolf commented 8 years ago

In terms of creating dummy variables (or indicator variables) for the categorical features, we can use the get_dummies function in pandas (pd.get_dummies) directly.

Official document: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html More Examples: http://stackoverflow.com/questions/29221894/pandas-get-dummies-vs-categorical http://stackoverflow.com/questions/24109779/running-get-dummies-on-several-dataframe-columns

evgenymun commented 8 years ago

Cool

evgenymun commented 8 years ago

Here is the list I found in one of the examples. We will have to doublecheck it: 'VAR_0001', 'VAR_0005', 'VAR_0008', 'VAR_0009', 'VAR_0010', 'VAR_0011', 'VAR_0012', 'VAR_0043', 'VAR_0044', 'VAR_0073', 'VAR_0075', 'VAR_0156', 'VAR_0157', 'VAR_0158', 'VAR_0159', 'VAR_0166', 'VAR_0167', 'VAR_0168', 'VAR_0169', 'VAR_0176', 'VAR_0177', 'VAR_0178', 'VAR_0179', 'VAR_0196', 'VAR_0200', 'VAR_0202', 'VAR_0204', 'VAR_0205', 'VAR_0214', 'VAR_0216', 'VAR_0217', 'VAR_0222', 'VAR_0226', 'VAR_0229', 'VAR_0230', 'VAR_0232', 'VAR_0236', 'VAR_0237', 'VAR_0239', 'VAR_0274', 'VAR_0283', 'VAR_0305', 'VAR_0325', 'VAR_0342', 'VAR_0352', 'VAR_0353', 'VAR_0354', 'VAR_0404', 'VAR_0466', 'VAR_0467', 'VAR_0493', 'VAR_1934'

evgenymun commented 8 years ago

Is this what we need for cap and floor? http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.scoreatpercentile.html#scipy.stats.scoreatpercentile

evgenymun commented 8 years ago

Numpy also has it: import numpy as np a = np.array([1,2,3,4,5]) p = np.percentile(a, 50) # return 50th percentile, e.g median. print p 3.0