Triamus / play

play repo for experiments (mainly with git)
1 stars 0 forks source link

data quality + data preprocessing from azure ml course #13

Open Triamus opened 7 years ago

Triamus commented 7 years ago

http://datascience.codata.org/articles/10.5334/dsj-2015-002/

https://dzone.com/articles/how-to-rock-data-quality-checks-in-the-data-lake

http://bubbles.databrewery.org/

http://www.stiivi.com/about.html

http://www.bigdataeverywhere.com/files/denver/BDE_Data_Governance_KAMREDDY.pdf

scattermatrix convert categorical to binary numeric feature (indicator/dummy variables) repeated values can cause bias as they have overstated weight missing value: remove row, substitute specific value, interpolate, fwd/bwd fill, impute R !duplicated() visualizing outliers with scatterplot matrix, pairs plot treatment of ouliers: censor, trim, interpolate, substitute scaling of numeric variables, treat outliers before scaling, e.g. z value scaling