Closed abieler closed 7 years ago
I like it on first glance. what happens if some element is NA
? does it error or does it propagate?
at the moment it errors at convert(). i planned on putting in a fallback that works on dataarrays directly if the column has NAs though.
I think the most neutral solution for NA
is that it just propagates. If a column contains NA then the resulting column is all NA
. This way a user is made aware that NA
have to be dealt with before centering
Not sure I understand this correctly.. This would mean all data in columns containing NAs is replaced with NAs, hence the data is lost?
Yes, since mean(...)
of a vector containing NA
is in turnNA
, and multiplying any number with NA
is also NA
, I would assume to end up with a column of NA
s if I don't pay attention to missing data
hmm... I could see myself in a mentally bad place in the middle of a long interactive session and then losing data because I forgot to take care of an NA :) What about skipping the column and a warn message to take care of NAs before rescaling?
I like it. The important thing to me is that NA
aren't silently ignored
OK good. I ll implement it that way and add some tests later on tonight
Thanks for working on this, by the way. This will be a really nice contribution
Do you feel it would be worth giving the user a choice for the rescaled column types? I dont think this is necessary as they probably convert the DataFrame to a matrix before feeding it to an ML algorithm. At this point the user can still modify the type.
You also mention a future MLPreprocessing package in the issues. I guess the feature scaling would be part of this. Have you put any more thought into this?
Very nice.
I dont think this is necessary as they probably convert the DataFrame to a matrix before feeding it to an ML algorithm.
I agree
You also mention a future MLPreprocessing package in the issues. I guess the feature scaling would be part of this. Have you put any more thought into this?
Yes. Well the code would move there basically. I did a similar thing with MLDataPattern. I don't think I'll get to this soon as other things are higher priority, but once I do I'll be sure to contact original authors. The approach I took with MLDataPattern is that I flag the authors of some functionality as the author of the commits that move that code
Do you consider this merge ready? because then I'll set some time aside soon to review this one last time to merge it
yes. merge ready for me.
Thanks!
I was thinking about working on the FeatureNormalizer() next. Including some other scaling types such as unit range and clipping. Say having: StandardScaler() UnitRangeScaler() -> pick range [lower, upper] to be scaled to ClippingScaler() -> clip data lower or higher than a threshold
where StandardScaler() is what FeatureNormalizer() is now. Sounds reasonable?
Sounds great! Is working on preprocessing of general interest to you? You know we could create the new package JuliaML/MLPreprocessing.jl together if you are interested in being a part of it.
No pressure though, I am also happy with the current mode.
That does sound interesting! I would need some guidance though... My primary interest is using Julia to do ML :) So whatever helps improving that I'm happy to work on.
That does sound interesting! I would need some guidance though
Sure, I'd love to work together on that. I'll write something up as soon as I find some time. Looking forward to it
I created https://github.com/JuliaML/MLPreprocessing.jl and invited you as collaborator. Let's continue the discussion over there.
I was thinking about working on the FeatureNormalizer() next. Including some other scaling types such as unit range and clipping. Say having: StandardScaler() UnitRangeScaler() -> pick range [lower, upper] to be scaled to ClippingScaler() -> clip data lower or higher than a threshold where StandardScaler() is what FeatureNormalizer() is now. Sounds reasonable?
This would be a fantastic next step (and solve #27 )
Looks like the most performant way is to convert dataframe columns to dense arrays, compute the changes on those and assign them back to the dataframe. Benchmarks within ~30% of pure dense matrix in terms of cpu time. The flip side is that this allocates a new vector for every column, even for float types. Currently non numeric columns are simply skipped.