JuliaML / MLPreprocessing.jl

Other
15 stars 2 forks source link

WIP feature scalings #1

Closed abieler closed 7 years ago

abieler commented 7 years ago

Got started on the feature scaling. A few changes I suggest:

1) rename rescale to standardize 2) rename FeatureNormalizer to StandardScaler 3) use transform instead of predict for rescaling data 4) Add fixedrange as scaling function which scales the data to a fixed range (lower:upper) which defaults to (0:1). 5) Add FixedRangeScaler

This all is still not fully functional and I ll keep working on it in the next few days.

Evizero commented 7 years ago

Very cool! Thanks for working on this. My main feedback at this point is to try and make sure that the types are concrete and inferrable. I.e. all member variables of an immutable type should be concrete

immutable Foo
  x::Vector # not concrete
  obsdim::ObsDim.Constant{} # not concrete
end
immutable Foo2{T,N}
  x::Vector{T} # better
  obsdim::ObsDim.Constant{N} # better
end
abieler commented 7 years ago

Question on how to handle DataFrames: For the lower level functions such as center! and standardize! the user can specify column names on which the transformation is applied. The question is now about the implementation of StandardScaler and UnitRangeScaler such as:

immutable StandardScaler{T,U,M}
    offset::Vector{T}
    scale::Vector{U}
    obsdim::ObsDim.Constant{M}
end

which for Arrays assume that every column is scaled. Add an additional Vector containing column names and is ignored for Arrays? Compute parameters such that the transformation has no effect on said columns, e.g. offset=0 and scale=1 for columns that are not picked. (which leads to unnecessary computations...) Other ideas?

Evizero commented 7 years ago

maybe opt_in, which for arrays could be a Vector{Bool} indicating if a feature is to be scaled. for a DataFrame it could be Vector{Symbol}.

so

immutable StandardScaler{T,U,M,I}
    offset::Vector{T}
    scale::Vector{U}
    obsdim::ObsDim.Constant{M}
    opt_in::Vector{I}
end

this could allow for selective scaling even for arrays

abieler commented 7 years ago

Yes I was thinking somewhere along those lines too. but with indices instead of a bool, which would be more convenient for larger data sets with only a few columns to scale. agree?

Evizero commented 7 years ago

Sure, I trust your instincts on that one

abieler commented 7 years ago

Finally had some time to get back to this. I think functionality wise it is about where I want it to be. What do you think? basic usage would be:

scaler = fit(StandardScaler, X[, mu, sigma; obsdim, operate_on])
scaler = fit(FixedRangeScaler, X[, lower, upper; obsdim, operate_on])

transform(X, scaler)
transform!(X, scaler)

And works for Matrix and DataFrames. Vector operate_on is to pick (by index or column name) on which "columns" the scaling occurs and defaults to all in case of a Matrix and to all numeric type columns for DataFrames.

StandardScaler fit and transform are in line with scikit-learn namings. FixedRangeScaler is called MinMaxScaler in scikit learn, which I dont really like as a name.

Further they have a fit_transform function which does both steps in one call, which is neat and easy to implement.

The name operate_on is not very satisfying to me, but acceptable for now or until someone comes up with a better name :)

Evizero commented 7 years ago

Hi! very cool. I will review this as soon as I find some spare time

Evizero commented 7 years ago

Just a quick comment. There is StatsModels.jl which deals with things like contrast coding (e.g. one-hot for columns). I suspect we might be better off leaving that task to the statisticians, and instead focus on the ML specific parts when it comes to data frames

abieler commented 7 years ago

Regarding one-hot encoding. do you suggest leaving it off completely of MLPreprocessing or to implement their methods to be used here?

Evizero commented 7 years ago

Regarding one-hot encoding. do you suggest leaving it off completely of MLPreprocessing

I suggest leaving it out completely. In general its fair to assume that users convert their DataFrame to an Array themselves. I do like that we provide feature scaling for DataFrames, but in general we should focus on arrays

abieler commented 7 years ago

Getting back at the one-hot encoding topic: StatsModels currently only works with DataTables, not DataFrames, so it is not compatible with MLPreprocessing at this point. Further it is a slightly different thing it achieves by going from a DataTable to Matrix. I personally prefer doing the encoding (as much preprocessing as possible really) on the DataFrame level and then do a straightforward convert(Matrix, DataFrame). So the onehot!() implementation here is different than from what is in StatsModels. I feel like this functionality should be present somewhere in the JuliaML universe, or maybe DataFrames or DataTables directly. What do you think?

Evizero commented 7 years ago

currently only works with DataTables

I know, but still. I want to avoid duplicated effort on the table front. Especially because I am not a statistician and I am only somewhat aware of all the nuances that come with that. For example is a one-hot encoding rarely used for categorical variables since it introduces a redundant feature that is better off being omitted since the "bias" is enough to catch that information (so instead the more widely used form of encoding is the "dummy encoding").

I personally prefer doing the encoding (as much preprocessing as possible really) on the DataFrame level and then do a straightforward convert(Matrix, DataFrame)

I think that is rather untypical. Coming from R what one usually does there is

  1. somehow end up with a dataframe with the variables one needs
  2. specify a "formula" of how to translate that dataframe into feature matrix and targets
  3. do the ML-specific preprocessing on the feature matrix
Evizero commented 7 years ago

If you feel strongly about it we can obviously still consider including such functionality. But in that case it would be easier to review/discuss in its own dedicated PR

abieler commented 7 years ago

Alright. Removed the encoding part and switched back the argument order for fit and fit_transform.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-30.5%) to 65.588% when pulling c9a396aed47c81a66c5f9a5e9621708e9ba064a7 on fixedrange into 25f6d6053b4b9ab5a626b6f449588db409bc4a2d on master.

Evizero commented 7 years ago

Looking great! One last thing I noticed. The obsolete file featurenormalizer.jl still needs to be deleted I think. Other than that this looks awesome and ready to merge.

One discussion point we should consider in the future is if we should switch from the corrected std to the uncorrected (i.e. std(x, corrected=false))

abieler commented 7 years ago

Is the corrected/uncorrected std() related to 1/N vs. 1/(N-1)? If so, I don't have a strong opinion on it. Can you explain your thoughts on this?

abieler commented 7 years ago

Removed feature_normalizer. I ll start on the README next

Evizero commented 7 years ago

Is the corrected/uncorrected std() related to 1/N vs. 1/(N-1)? If so, I don't have a strong opinion on it. Can you explain your thoughts on this?

Yes. My thoughts are mostly motivated by consistency with what other frameworks seem to do

abieler commented 7 years ago

Alright. This works now in v0.5 and v0.6. Ready to merge from my side.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-28.3%) to 67.781% when pulling 47891878dcf64dcb591793d1772993c6ebbd637e on fixedrange into 25f6d6053b4b9ab5a626b6f449588db409bc4a2d on master.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-28.3%) to 67.781% when pulling 686a308b1f5f40486f016426565733c8cc18789c on fixedrange into 25f6d6053b4b9ab5a626b6f449588db409bc4a2d on master.

Evizero commented 7 years ago

Awesome! This is a really big improvement!