Closed abieler closed 7 years ago
Very cool! Thanks for working on this. My main feedback at this point is to try and make sure that the types are concrete and inferrable. I.e. all member variables of an immutable type should be concrete
immutable Foo
x::Vector # not concrete
obsdim::ObsDim.Constant{} # not concrete
end
immutable Foo2{T,N}
x::Vector{T} # better
obsdim::ObsDim.Constant{N} # better
end
Question on how to handle DataFrames: For the lower level functions such as center! and standardize! the user can specify column names on which the transformation is applied. The question is now about the implementation of StandardScaler and UnitRangeScaler such as:
immutable StandardScaler{T,U,M}
offset::Vector{T}
scale::Vector{U}
obsdim::ObsDim.Constant{M}
end
which for Arrays assume that every column is scaled. Add an additional Vector containing column names and is ignored for Arrays? Compute parameters such that the transformation has no effect on said columns, e.g. offset=0 and scale=1 for columns that are not picked. (which leads to unnecessary computations...) Other ideas?
maybe opt_in
, which for arrays could be a Vector{Bool}
indicating if a feature is to be scaled. for a DataFrame
it could be Vector{Symbol}
.
so
immutable StandardScaler{T,U,M,I}
offset::Vector{T}
scale::Vector{U}
obsdim::ObsDim.Constant{M}
opt_in::Vector{I}
end
this could allow for selective scaling even for arrays
Yes I was thinking somewhere along those lines too. but with indices instead of a bool, which would be more convenient for larger data sets with only a few columns to scale. agree?
Sure, I trust your instincts on that one
Finally had some time to get back to this. I think functionality wise it is about where I want it to be. What do you think? basic usage would be:
scaler = fit(StandardScaler, X[, mu, sigma; obsdim, operate_on])
scaler = fit(FixedRangeScaler, X[, lower, upper; obsdim, operate_on])
transform(X, scaler)
transform!(X, scaler)
And works for Matrix and DataFrames.
Vector operate_on
is to pick (by index or column name) on which "columns" the scaling occurs and defaults to all in case of a Matrix and to all numeric type columns for DataFrames.
StandardScaler
fit
and transform
are in line with scikit-learn namings. FixedRangeScaler is called MinMaxScaler in scikit learn, which I dont really like as a name.
Further they have a fit_transform
function which does both steps in one call, which is neat and easy to implement.
The name operate_on
is not very satisfying to me, but acceptable for now or until someone comes up with a better name :)
Hi! very cool. I will review this as soon as I find some spare time
Just a quick comment. There is StatsModels.jl which deals with things like contrast coding (e.g. one-hot for columns). I suspect we might be better off leaving that task to the statisticians, and instead focus on the ML specific parts when it comes to data frames
Regarding one-hot encoding. do you suggest leaving it off completely of MLPreprocessing or to implement their methods to be used here?
Regarding one-hot encoding. do you suggest leaving it off completely of MLPreprocessing
I suggest leaving it out completely. In general its fair to assume that users convert their DataFrame to an Array themselves. I do like that we provide feature scaling for DataFrames, but in general we should focus on arrays
Getting back at the one-hot encoding topic: StatsModels
currently only works with DataTables
, not DataFrames
, so it is not compatible with MLPreprocessing
at this point. Further it is a slightly different thing it achieves by going from a DataTable
to Matrix
. I personally prefer doing the encoding (as much preprocessing as possible really) on the DataFrame
level and then do a straightforward convert(Matrix, DataFrame)
. So the onehot!()
implementation here is different than from what is in StatsModels
. I feel like this functionality should be present somewhere in the JuliaML universe, or maybe DataFrames
or DataTables
directly. What do you think?
currently only works with DataTables
I know, but still. I want to avoid duplicated effort on the table front. Especially because I am not a statistician and I am only somewhat aware of all the nuances that come with that. For example is a one-hot encoding rarely used for categorical variables since it introduces a redundant feature that is better off being omitted since the "bias" is enough to catch that information (so instead the more widely used form of encoding is the "dummy encoding").
I personally prefer doing the encoding (as much preprocessing as possible really) on the DataFrame level and then do a straightforward convert(Matrix, DataFrame)
I think that is rather untypical. Coming from R
what one usually does there is
If you feel strongly about it we can obviously still consider including such functionality. But in that case it would be easier to review/discuss in its own dedicated PR
Alright. Removed the encoding part and switched back the argument order for fit
and fit_transform
.
Looking great! One last thing I noticed. The obsolete file featurenormalizer.jl
still needs to be deleted I think. Other than that this looks awesome and ready to merge.
One discussion point we should consider in the future is if we should switch from the corrected std
to the uncorrected (i.e. std(x, corrected=false)
)
Is the corrected/uncorrected std() related to 1/N vs. 1/(N-1)? If so, I don't have a strong opinion on it. Can you explain your thoughts on this?
Removed feature_normalizer
. I ll start on the README next
Is the corrected/uncorrected std() related to 1/N vs. 1/(N-1)? If so, I don't have a strong opinion on it. Can you explain your thoughts on this?
Yes. My thoughts are mostly motivated by consistency with what other frameworks seem to do
Alright. This works now in v0.5 and v0.6. Ready to merge from my side.
Awesome! This is a really big improvement!
Got started on the feature scaling. A few changes I suggest:
1) rename
rescale
tostandardize
2) renameFeatureNormalizer
toStandardScaler
3) usetransform
instead ofpredict
for rescaling data 4) Addfixedrange
as scaling function which scales the data to a fixed range (lower:upper) which defaults to (0:1). 5) AddFixedRangeScaler
This all is still not fully functional and I ll keep working on it in the next few days.