dsp-uga / Team-Marianne-p2

https://github.com/dsp-uga
MIT License
0 stars 0 forks source link

Accuracy on large dataset really low #14

Open ankit-vaghela30 opened 6 years ago

ankit-vaghela30 commented 6 years ago

Our accuracy on small dataset is coming out to be 80 % but it's coming out to be very less for large dataset.

ankit-vaghela30 commented 6 years ago

One issue comes to mind that for large dataset, we are not saving the hash_ids and just saving features and labels. It is a possibility that the RDDs got shuffled and our prediction is not ordered.

This is distributed programming so each node gets to store and process some amount of data. It does not maintain the order. I have a string suspicion that this is the reason behind accuracy being low. We are predicting the features right. It's just not in right order.

I would really want to run this again storing the hash_ids this time and compare it with y_test.txt (on autograder). I would really appreciate it if access to this file is provided