WebClub-NITK / Hacktoberfest-2k20

Repository for Hacktoberfest 2020 Meetup at NITK Surathkal
14 stars 53 forks source link

Breast cancer prediction #46 file added #57

Closed RohanSahana closed 3 years ago

RohanSahana commented 3 years ago

Resolves Issue <#46 >

Description

Added breast cancer prediction with Logistic Regression, Random Forest and SVM (Linear and RBF)

Technical Specifications

Scores for each of the classifier are impressive, i.e. over 95. It proves that it is an excellent model.

How to run

Just open the files by using notebook software like - Jupyter notebook or you can use Google Colab.

Checklist

amukh18 commented 3 years ago

@RohanSahana

X_train = X
Y_train = Y

These lines defeat the purpose of the train_test_split() function. You can also see that the score (accuracy) attained for your function becomes 100%, which is unrealistic in a well-tested machine learning model. By including these lines your training data regains the training examples that you set aside for testing. So when you fit your model, your model also trains on these examples, which should not have remained in X_train and Y_train. When you test your model on the X_test and Y_test, you are only making predictions examples you have already trained on, causing your model to score 100%. You also seem to have an unrelated file in your svm folder.

RohanSahana commented 3 years ago

@RohanSahana

X_train = X
Y_train = Y

These lines defeat the purpose of the train_test_split() function. You can also see that the score (accuracy) attained for your function becomes 100%, which is unrealistic in a well-tested machine learning model. By including these lines your training data regains the training examples that you set aside for testing. So when you fit your model, your model also trains on these examples, which should not have remained in X_train and Y_train. When you test your model on the X_test and Y_test, you are only making predictions examples you have already trained on, causing your model to score 100%. You also seem to have an unrelated file in your svm folder.

Training must be 100% and the values we will predict in future will be unknown and therefore give the best results on testing. train_test_split() is just for our satisfaction to know the performance of our model. But in real-life problem, 100% training is best just like teaching a student 100% and taking the test. The thing which matters is the score of the test dataset. And there is no unrelated file in svm folder, there are 2 files in it. One is the model using linear svm classifier and other uses rbf svm classifier.

amukh18 commented 3 years ago

@RohanSahana Therefore running the prediction on your test examples as you have done would be redundant in this case. Perhaps the problem statement was a little vague. You can then remove the two lines I specified in my previous comment, train your models on the training you made after splitting, run predictions on the test examples you made after splitting, print the score, and then repost your files. You can ignore my comment on your svm folder. Your files are alright.