JuliaML / MLDataUtils.jl

Utility package for generating, loading, splitting, and processing Machine Learning datasets
http://mldatautilsjl.readthedocs.io/
Other
102 stars 20 forks source link

Naming Convention for FeatureNormalizer #27

Open asbisen opened 7 years ago

asbisen commented 7 years ago

FeatureNormalizer transforms the matrix X using (X - μ)/σ which translates to StandardScaler in Scikit-Learn. Whereas Normalize method in Scikit-Learn scales the data to a unit norm. I was wondering if we should rename FeatureNormalizer to FeatureStandardizer or something to that effect.

Also is there a reason for having FeatureNormalizer to expect the Matrix such that the features are represented in rows and not columns

And for the last issue I don't know which is correct way Scikit or MLDataUtils. But there is a slight inconsistency between how StandardScaler in Scikit calculates standard deviation vs MLDataUtils. With Scikit they use n to scale the sum and we use n-1 to scale the sum while calculating standard deviation.

Reference: Scikit-Learn Standardize Reference: Scikit-Learn Normalize

Evizero commented 7 years ago

Hi! All good feedback. The FeatureNormalizer is quite old and a little outdated. I will rewrite it at some point. I think it would be a good idea to give consistent results with either Scikit-learn or Caret (R package). Neither of which I checked when I wrote this.

The row vs column thing has to do with Julia's array-memory order, but after a rewrite it will be possible to choose the observation dimension, similar to how LossFunctions allows it