amplab / keystone

Simplifying robust end-to-end machine learning on Apache Spark.
http://keystone-ml.org/
Apache License 2.0
470 stars 117 forks source link

BlockSolve should handle empty partitions #196

Closed ericmjonas closed 8 years ago

ericmjonas commented 8 years ago

After discussion with @tomerk it seems that it would be really useful to have the various Block Sovlers not choke on empty partitions. Empty partitions can arise in the course of cross-validation when you want to fiter your data RDD into a "train" rdd and a "test" rdd.

ericmjonas commented 8 years ago

Also @tomerk suggests that the reshuffle that follows this warning: "15/12/27 21:22:21 WARN BlockWeightedLeastSquaresEstimator: Partitions do not contain elements of the same class. Re-shuffling" can sometimes create empty partitions as well

shivaram commented 8 years ago

Hmm the reshuffle creates an RDD with exactly numClasses partitions afaik, does this happen when you have have no examples in a class ? Anyways we can make the rest of the algorithm work with empty partitions

ericmjonas commented 8 years ago

@shivaram In that case I was running up against https://github.com/amplab/keystone/issues/197 so all my data were being assigned the same class label. That said, empty partitions are still a problem.

shivaram commented 8 years ago

Ah yes - That makes sense. If you have it handy, could you paste a stack trace you get when you have empty partitions ?