Open nirveshk opened 6 years ago
I think this is an interesting one as I recalled that I was asked by prof. Andy why I picked 75:25 ratio for training and testing dataset on my machine learning project for HS614. And to be honest, I just took the ratio without any principled reasons.
To make this ambitious issue to be implemented, I`d like to suggest that you add following information/assumptions in the frame:
I think its a great idea Nirvesh. But as Helen suggested make it a bit challenging if not It's pretty straight-forward in fact its a just a couple of if-else conditions once you find the length of data set. So think of way to make it more challenging/interesting and applicable to linear regression. Think of cross-validation vs testing/training split when its appropriate to just go ahead with CV skipping the testing/training split and presenting the results we discussed about this briefly in 614 or any other idea you can find.
There's also a school of thought where the dataset is actually split into 3, where there is the training set, verification set and test set. So any re-engineering of the model will only be tried on the verification set, and once the model has been decided, then its used on the completely unseen test set.
@hhan14 @rohitchadaram @choikwunyu Thank you guys for your valuable insights on this. I am actually doing some research as I am writing this to understand how I can perform this. @rohitchadaram, you made it sound very simple. I hope it is indeed as easy, and if it is, I can incorporate aspects of cross validation to make it little challenging.
I think this is a good idea but normally the ratio for train/test split is between 70 & 80 for the training set and between 30 & 20 for the test set. So I don't think there is some thumb rule for the perfect train/test split. So how will you suggest the user a perfect ratio for any dataset? The only different ratio I know is 90:10 and it is mostly used when you have millions of instances.
There are multiple ratios that can be chosen to split the data in training and test set. I would like to purpose a feature/algorithm that will assess the data in its entirety and give back/suggest users somewhat perfect ratio that will divide the data into training and test set so that both sets have enough data to train and test the module.
This might seem like a lot of work as the algorithm will have to loop through every single data point and give a point where the data is partitioned ideally. I am not sure if this is even possible but I would like to give it a shot.