SNICScienceCloud / LDSA-Spark

A collections of Apache Spark notebooks for the LDSA course
Apache License 2.0
0 stars 3 forks source link

Questions about C3 #2

Open arminhaskovic opened 7 years ago

arminhaskovic commented 7 years ago

I have a question about the challenge part of C3 where you're supposed to build a 3NN classifier. Right now it checks the "votes" of the three closest neighbors and then by majority vote classifies the point that we want to classify. But what happens if the three closest neighbors all cast different votes and there is no majority vote? Should it just pick one of them at random or go back to 1NN for that classification or is it something else that is supposed to be done?

Is it also possible to make sure that all of the workers are receiving the jobs? I've been using the Spark API but it would be nice to somehow make sure that it works as intended.

mcapuccini commented 7 years ago

Hi @arminhaskovic! That's a good question. My gut feeling is that you would get best result by returning the closest neighbour when majority vote is not possible (so like 1NN). You can try also other strategies and see what performs best, that's part of the challenge.

To make sure that the computation is going in parallel you can give a look at the Spark UI. You should be able to enter into the currently running application UI, and to navigate into the workers section, where you can see if every worker is getting something to compute.