EHWUSF / HS68_2018_Project_1

0 stars 9 forks source link

Train/Test split #11

Open nirveshk opened 6 years ago

nirveshk commented 6 years ago

There are multiple ratios that can be chosen to split the data in training and test set. I would like to purpose a feature/algorithm that will assess the data in its entirety and give back/suggest users somewhat perfect ratio that will divide the data into training and test set so that both sets have enough data to train and test the module.

This might seem like a lot of work as the algorithm will have to loop through every single data point and give a point where the data is partitioned ideally. I am not sure if this is even possible but I would like to give it a shot.

hhan14 commented 6 years ago

I think this is an interesting one as I recalled that I was asked by prof. Andy why I picked 75:25 ratio for training and testing dataset on my machine learning project for HS614. And to be honest, I just took the ratio without any principled reasons.

To make this ambitious issue to be implemented, I`d like to suggest that you add following information/assumptions in the frame:

  1. It seems like there is a function of "sklearn.model_selection.train_test_split" in sklearn already, which you can refer to. It says if the test_size is assigned as "None", the value is set to the complement of the train size. By default, the value is set to 0.25. The default will change in version 0.21. It will remain 0.25 only if train_size is unspecified, otherwise it will complement the specified train_size. [link: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html]
  2. I think that how well trained the train data enough to increase the result when it applied to test dataset is mainly dependent on the impact of aspects of the datasets such as outliers. Whether you can handle the variance of the dataset by reducing/balancing the outliers distributions when dividing the training/test data will be the key. And differentiated criteria for assessing complexity of the dataset should be applied based on the size of the datasets.
  3. Last, I`m sharing the discussions over the best train/test split ratio on the web for your information. [link: https://www.researchgate.net/post/What_is_the_best_way_to_divide_a_dataset_into_training_and_test_sets]
rohitchadaram commented 6 years ago

I think its a great idea Nirvesh. But as Helen suggested make it a bit challenging if not It's pretty straight-forward in fact its a just a couple of if-else conditions once you find the length of data set. So think of way to make it more challenging/interesting and applicable to linear regression. Think of cross-validation vs testing/training split when its appropriate to just go ahead with CV skipping the testing/training split and presenting the results we discussed about this briefly in 614 or any other idea you can find.

choikwun commented 6 years ago

There's also a school of thought where the dataset is actually split into 3, where there is the training set, verification set and test set. So any re-engineering of the model will only be tried on the verification set, and once the model has been decided, then its used on the completely unseen test set.

nirveshk commented 6 years ago

@hhan14 @rohitchadaram @choikwunyu Thank you guys for your valuable insights on this. I am actually doing some research as I am writing this to understand how I can perform this. @rohitchadaram, you made it sound very simple. I hope it is indeed as easy, and if it is, I can incorporate aspects of cross validation to make it little challenging.

kamehta2 commented 6 years ago

I think this is a good idea but normally the ratio for train/test split is between 70 & 80 for the training set and between 30 & 20 for the test set. So I don't think there is some thumb rule for the perfect train/test split. So how will you suggest the user a perfect ratio for any dataset? The only different ratio I know is 90:10 and it is mostly used when you have millions of instances.