amplab / keystone

Simplifying robust end-to-end machine learning on Apache Spark.
http://keystone-ml.org/
Apache License 2.0
470 stars 117 forks source link

Added the Scala KMeans++, GMM, and Fisher Vectors #191

Closed tomerk closed 8 years ago

tomerk commented 8 years ago

for issue #181, and paves the way for issue #174

etrain commented 8 years ago

This looks pretty good - most of my comments were on naming and comments. The unit test for GMM should be a little more robust, and I left an idea about how to do that. There's a minor but pretty crucial note about the use of += in there, and I'd like to take another pass with my "how could in-place ops be breaking things" hat on before we merge this.

tomerk commented 8 years ago

@etrain so I think the only things left are adding a reference to the GMM paper (what is the paper?) And figuring out the naming / name spaces.

The C GMM estimator is nodes.learning.GaussianMixtureModelEstimator I'm fine moving that around and/or renaming ScalaGMMEstimator, just let me know what you'd like each to be.

etrain commented 8 years ago

Let's put the let's have a nodes.learning.external.GaussianMixtureModelEstimator which is the C thing (and be sure to update all existing pipelines to use this) and then name the scala one just nodes.learning.GaussianMixtureModelEstimator. We've got this convention on the featurizers as well, so it makes sense to me organizationally.

etrain commented 8 years ago

Appendix B of the following paper has guidelines for training the GMM:

Jorge Sanchez, Florent Perronnin, Thomas Mensink, Jakob Verbeek. Image Classification with the Fisher Vector: Theory and Practice. International Journal of Computer Vision, Springer Verlag, 2013, 105 (3), pp.222-245. <10.1007/s11263-013-0636-x>.

tomerk commented 8 years ago

Okay @etrain how's this looking?

etrain commented 8 years ago

It doesn't look like you updated the existing pipelines that use the GMM/FV nodes to point to the external namespace?

tomerk commented 8 years ago

Ah, I can see where the confusion comes from. So, the pipelines never use the GaussianMixtureModelEstimator directly, they only use the GMMFisherVectorEstimator (already in external), which I have now changed to point at the external GMM Estimator. And, the ones that use FisherVector were already pointing at the external.FisherVector node.

etrain commented 8 years ago

Ah, I see - sorry, I should have verified :)

LGTM!