Accuracy on large dataset really low

One issue comes to mind that for large dataset, we are not saving the hash_ids and just saving features and labels. It is a possibility that the RDDs got shuffled and our prediction is not ordered.

This is distributed programming so each node gets to store and process some amount of data. It does not maintain the order. I have a string suspicion that this is the reason behind accuracy being low. We are predicting the features right. It's just not in right order.

I would really want to run this again storing the hash_ids this time and compare it with y_test.txt (on autograder). I would really appreciate it if access to this file is provided

dsp-uga / Team-Marianne-p2

Accuracy on large dataset really low #14