Open shmoradims opened 6 years ago
One topic of note: we should be pushing a full retrain after your CV/TrainTest.
The full retraining on the entire dataset gives you a better model. The CV/TrainTest gives your pipeline's metrics { accuracy, AUC, NDCG, etc }. The full retraining, on 100% of the dataset, is the model to launch in production.
Hence we should form our examples as:
@justinormont - Good point. You mean then that in the cases where we have two datasets (training dataset and test dataset) we should also have the full-dataset (merging both) and show how you also must do a last train on 100% of data once you are okay with the metrics, right?
But... since that is the last part in a process to be done after the iterations trying to improve the metrics, do you think we should implement that code in the samples or only explain that process in the guidance?
We can certainly implement it in the sample and when you are still iterating you'd simply comment the code for the last train?
Thoughts?
Cesar, I know what Justin is talking about. I noticed this issue too, where CV models are used as the final model, instead of doing step 8 above. I know how to fix it.
Sure, I understand the issue. I'm just saying that if you are still iterating, the sample code shouldn't run the code to train with the full dataset at the end of the process, that's why you might want to comment that last execution until you want.. Let's chat about it offline. đź‘Ť
@justinormont, I have the following suggestions:
1) For cases where we have only one dataset, we can stick to your 9-step plan above. Do CV for evaluation and tuning, then train the final model on the one dataset, which is the full dataset.
2) I suggest we do not push for full training if the dataset already comes as separate train and test sets, like how most Kaggle and public datasets are. For competitions, they combine train and test sets, b/c there's another private test set for the final evaluation. So I think, for our samples, we should skip step #8. Mixing train and test sets should only be reserved for data scientists that know what they're doing and in cases when there's a second test set for final evaluation. I suggest keeping samples' ML at 100-level to match our audience, or we risk confusing them.
@shmoradims - I agree. Another variation between #1 and #2 if having a single dataset is to split in 80%-20% in memory the original full dataset and then train with the 80% DataView and Test the 20% Dataview. This is also simple to do and requires significant less time than CV.
Yes, instead of CV, we can split the data to 80% train, 20% test ourselves, and not mix it back again.
I removed the "Titanic" sample review because there are issues with its DataSet in regards licensing, as I was told by LCA, so we're removing this sample which at the end of the day was not very practical or enterprise oriented.
@shmoradims is it possible to add F# in mix as well. Both are similar only but having Ok
around it will surely help. F# do have little different style from coding point of view, so reviewer can validate that as well. cc/ @CESARDELATORRE @dsyme
@shmoradims - Cesar told me to ask you if you can do a quick review of the current samples already migrated to 0.11. From data science, algorithms, etc. it should be pretty similar than when we had it in 0.8.
However, a few metrics are not good enough like in Sentiment Analysis (should be higher) and Iris classification (it is actually 1, like overfitting, should be lower). Could you review it, when possible, please?
Hi, I noticed that Normalization has been removed from the samples… perhaps one should explain the reasoning for this.
@PeterPann23 - What specific sample had normalization removed?
Have a look at the v1.0.0-preview-All-Samples and search for the API... I do not find much direct use. Bike-sharing mentions it in comments only.
@PeterPann23 - Normalization can be applied depending on each specific sample. Sometimes it makes sense, sometimes it doesn't, that's why I'm asking about what specific sample you think it should have normalization for certain columns?
I guess if not needed one should definite specify it in the samples. I was under the Assumtion it was always needed for the runtime data to match the static file.
Status
DS Review
BinaryClassification_CreditCardFraudDetection
BinaryClassification_SentimentAnalysis
Clustering_CustomerSegmentation
Clustering_Iris
MulticlassClassification_Iris
Regression_BikeSharingDemand
Ok.
Regression_TaxiFarePrediction
MatrixFactorization_MovieRecommendation
MF using MFTrainer. Evaluation done as regressions.
MulticlassClassification-GitHubLabeler
Regression-SalesForecast (eShopDashboardML)
AnomalyDetection-Sales