Chapter3:Minist_Classify

anyuese commented 5 years ago

hello,I am a very new person in sklearn,I have a question while learing chapter3, the books writen this:

I know the goal is getting the descision scores,but why not use sgd_clf.decision_function()

ageron commented 5 years ago

Good question @anyuese . Because cv=3, the cross_val_predict() function will split the dataset into 3 distinct parts (called "folds"), then it will create 3 clones of the sgd_clf, and it will train all of them like this: the k**th clone will be trained on all folds except for the kth fold, and it will be used to make predictions for the k**th fold. This means almost 3 times more computing is required when calling cross_val_predict() compared to just calling sgd_clf.decision_function(). Not quite 3 times, since each clone is trained on just 2/3rds of the training set. But the benefit is that the predictions will be "realistic", in the sense that the model will not have been trained on the data it is making predictions for. So you can get a more precise idea of how well your model is going to perform once it is in production and is fed new data. I hope this is clear! Note that it is all explained in the book, so don't hesitate to go back and read through the part about K-fold cross-validation, if needed. Cheers!

anyuese commented 5 years ago

Thank you,master.I have another problem while using knn_fit and knn_predict,it just use a few time,but when using cross validation predicton then using f1_score() it cost me lots of time, I just think cv=3,the computation is around 3 times then just using knn_fit ,knn_predict. And while scoring f1,computation is fewer.But acutually,it's not,and I don't know why

ageron commented 5 years ago

Yes, KNN can be very slow. Try running the code on 1/10th of the dataset to see if it runs smoothly. Normally the cross val functions should be about close to 3 times slower when cv=3.

qy-yang commented 5 years ago

Hi @ageron,

Thanks for your explanation above. I have a question regarding the line precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores). Why the length of all the possible threshold is 59698 instead of 60000. As a naive way to think about this is that you can use every y_score as a possible threshold so that you can have 60000 sets of different prediction results.

Thank you in advance. Regards, QY

ageron commented 5 years ago

Hi @qy-yang , Great question! I haven't checked, but I suppose these are all the distinct scores.

AlessandroMiola commented 3 years ago

Hi @qy-yang, @ageron, I had the same doubt expressed by @qy-yang. For me, given that the scores for this specific examples are all distinct (when I run len(y_scores) I get 60000), the point is the one specified here. Basically, the output is omitted for all thresholds that result in full recall, thus causing thresholds to be shorter than y_scores.

ageron / handson-ml

Chapter3:Minist_Classify #344