JuliaML / MLDataUtils.jl

Utility package for generating, loading, splitting, and processing Machine Learning datasets
http://mldatautilsjl.readthedocs.io/
Other
102 stars 20 forks source link

rescale_dataframes #33

Closed abieler closed 7 years ago

abieler commented 7 years ago

Looks like the most performant way is to convert dataframe columns to dense arrays, compute the changes on those and assign them back to the dataframe. Benchmarks within ~30% of pure dense matrix in terms of cpu time. The flip side is that this allocates a new vector for every column, even for float types. Currently non numeric columns are simply skipped.

Evizero commented 7 years ago

I like it on first glance. what happens if some element is NA ? does it error or does it propagate?

abieler commented 7 years ago

at the moment it errors at convert(). i planned on putting in a fallback that works on dataarrays directly if the column has NAs though.

Evizero commented 7 years ago

I think the most neutral solution for NA is that it just propagates. If a column contains NA then the resulting column is all NA. This way a user is made aware that NA have to be dealt with before centering

abieler commented 7 years ago

Not sure I understand this correctly.. This would mean all data in columns containing NAs is replaced with NAs, hence the data is lost?

Evizero commented 7 years ago

Yes, since mean(...) of a vector containing NA is in turnNA, and multiplying any number with NA is also NA, I would assume to end up with a column of NAs if I don't pay attention to missing data

abieler commented 7 years ago

hmm... I could see myself in a mentally bad place in the middle of a long interactive session and then losing data because I forgot to take care of an NA :) What about skipping the column and a warn message to take care of NAs before rescaling?

Evizero commented 7 years ago

I like it. The important thing to me is that NA aren't silently ignored

abieler commented 7 years ago

OK good. I ll implement it that way and add some tests later on tonight

Evizero commented 7 years ago

Thanks for working on this, by the way. This will be a really nice contribution

abieler commented 7 years ago

Do you feel it would be worth giving the user a choice for the rescaled column types? I dont think this is necessary as they probably convert the DataFrame to a matrix before feeding it to an ML algorithm. At this point the user can still modify the type.

abieler commented 7 years ago

You also mention a future MLPreprocessing package in the issues. I guess the feature scaling would be part of this. Have you put any more thought into this?

Evizero commented 7 years ago

Very nice.

I dont think this is necessary as they probably convert the DataFrame to a matrix before feeding it to an ML algorithm.

I agree

You also mention a future MLPreprocessing package in the issues. I guess the feature scaling would be part of this. Have you put any more thought into this?

Yes. Well the code would move there basically. I did a similar thing with MLDataPattern. I don't think I'll get to this soon as other things are higher priority, but once I do I'll be sure to contact original authors. The approach I took with MLDataPattern is that I flag the authors of some functionality as the author of the commits that move that code

Evizero commented 7 years ago

Do you consider this merge ready? because then I'll set some time aside soon to review this one last time to merge it

abieler commented 7 years ago

yes. merge ready for me.

Evizero commented 7 years ago

Thanks!

abieler commented 7 years ago

I was thinking about working on the FeatureNormalizer() next. Including some other scaling types such as unit range and clipping. Say having: StandardScaler() UnitRangeScaler() -> pick range [lower, upper] to be scaled to ClippingScaler() -> clip data lower or higher than a threshold

where StandardScaler() is what FeatureNormalizer() is now. Sounds reasonable?

Evizero commented 7 years ago

Sounds great! Is working on preprocessing of general interest to you? You know we could create the new package JuliaML/MLPreprocessing.jl together if you are interested in being a part of it.

No pressure though, I am also happy with the current mode.

abieler commented 7 years ago

That does sound interesting! I would need some guidance though... My primary interest is using Julia to do ML :) So whatever helps improving that I'm happy to work on.

Evizero commented 7 years ago

That does sound interesting! I would need some guidance though

Sure, I'd love to work together on that. I'll write something up as soon as I find some time. Looking forward to it

Evizero commented 7 years ago

I created https://github.com/JuliaML/MLPreprocessing.jl and invited you as collaborator. Let's continue the discussion over there.

I was thinking about working on the FeatureNormalizer() next. Including some other scaling types such as unit range and clipping. Say having: StandardScaler() UnitRangeScaler() -> pick range [lower, upper] to be scaled to ClippingScaler() -> clip data lower or higher than a threshold where StandardScaler() is what FeatureNormalizer() is now. Sounds reasonable?

This would be a fantastic next step (and solve #27 )