Clustering code cleanup

tumido commented 5 years ago

I'm trying to make the clustering code readable. Let's simplify and optimize it.

Related: ~https://github.com/numpy/numpy/issues/11999~ (Not any more relevant)

tumido commented 5 years ago

@durandom would you be interested in this type of PRs? I have also some questions about the current codebase - I'm placing them inline as this PR review.

tumido commented 5 years ago

cc @bronaghs

tumido commented 5 years ago

Also the numpy issue shouldn't matter anymore, since the missing data handling can be done in Pandas directly. I'll fix that in later iteration of this PR later today...

durandom commented 5 years ago

@MichaelClifford could you do a review too?

tumido commented 5 years ago

@durandom, I've rewritten the rest:

Since the PCA is not used much at this point, I've removed it. I can come up with a dynamic solution later. Let's focus on a static clustering for now.
Also the StandardScalers are removed since they make no sense over boolean values.
I've skipped the column removal for system_id and account since the data are converted to boolean as well -> that way they do not bias the results and it would be much more expensive to remove them than just not paying attention to them.
It gives me pretty good results and pretty fast. I need to do some measurements though.

@Ladas, here you go. This is the "Pandas only" approach I was talking about

RedHatInsights / aiops-insights-clustering

Clustering code cleanup #9