Closed tumido closed 6 years ago
@durandom would you be interested in this type of PRs? I have also some questions about the current codebase - I'm placing them inline as this PR review.
cc @bronaghs
Also the numpy issue shouldn't matter anymore, since the missing data handling can be done in Pandas directly. I'll fix that in later iteration of this PR later today...
@MichaelClifford could you do a review too?
@durandom, I've rewritten the rest:
StandardScaler
s are removed since they make no sense over boolean values.system_id
and account
since the data are converted to boolean as well -> that way they do not bias the results and it would be much more expensive to remove them than just not paying attention to them.@Ladas, here you go. This is the "Pandas only" approach I was talking about
I'm trying to make the clustering code readable. Let's simplify and optimize it.
Related: ~https://github.com/numpy/numpy/issues/11999~ (Not any more relevant)