amplab / keystone

Simplifying robust end-to-end machine learning on Apache Spark.
http://keystone-ml.org/
Apache License 2.0
470 stars 117 forks source link

Fix residual mean computation #175

Closed shivaram closed 9 years ago

shivaram commented 9 years ago

The previous approach of computing per-partition mean only works when number of elements per-class is equal. To fix it we now compute the sum per partition and the count per partition.

Many thanks to @tomerk and @ericmjonas for providing a test case. FWIW it wasn't being caught by the unit tests as we had equal number of elements per class in them. I'll add a unit test with uneven number of examples per class.

cc @rolloff

etrain commented 9 years ago

This looks good, thanks @shivaram ! Do you want me to wait for a unit test to merge?

shivaram commented 9 years ago

Yeah I'll write one -- it'll probably be easier / better if I refactor this as a utils function as well

shivaram commented 9 years ago

I created computeMean in MatrixUtils and added a test case for that.

etrain commented 9 years ago

And the unit test for the solver you mention in the PR description goes here or in ml-matrix?

shivaram commented 9 years ago

It should go here but its more tricky to add -- we need to setup another test case and either verify it with Matlab or the unweighted solver or something like that. So I thought it'd be easier to test for just the computeMean here.

etrain commented 9 years ago

Ok - merging this and creating an issue for you to add such a unit test.