dotnet / machinelearning-samples

Samples for ML.NET, an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
4.49k stars 2.69k forks source link

Review samples for correct data science approach and ML.NET API usage #81

Open shmoradims opened 6 years ago

shmoradims commented 6 years ago

Status

Folder Sample Data Science Review API Review
C#\getting_started BinaryClassification_ CreditCardFraudDetection OK (#96) OK (v0.8)
C#\getting_started BinaryClassification_ SentimentAnalysis OK OK (v0.8)
C#\getting_started Clustering_ CustomerSegmentation OK (#95) OK (v0.8)
C#\getting_started Clustering_ Iris OK (#109) OK (v0.8)
C#\getting_started MulticlassClassification_ Iris OK OK (v0.8)
C#\getting_started Regression_ BikeSharingDemand OK OK (v0.8)
C#\getting_started Regression_ TaxiFarePrediction OK (#95) OK (v0.8)
C#\getting_started DeepLearning ImageClassification TensorFlow Pixel data preprocessing needed or not? Also, info/github page for the used inception model should be included OK (v0.8) #155
C#\getting_started DeepLearning_ TensorFlowEstimator Same as above Still 0.7
C#\getting_started MatrixFactorization_ MovieRecommendation OK Still 0.7
C#\end-to-end-apps MulticlassClassification- GitHubLabeler OK (#96) OK (v0.8)
C#\end-to-end-apps Recommendation- MovieRecommender OK Still 0.7
C#\end-to-end-apps Regression- SalesForecast OK OK (v0.8)
C#\getting_started AnomalyDetection- Sales OK OK (v0.11)

DS Review

BinaryClassification_CreditCardFraudDetection

BinaryClassification_SentimentAnalysis

Clustering_CustomerSegmentation

Clustering_Iris

MulticlassClassification_Iris

Regression_BikeSharingDemand

Ok.

Regression_TaxiFarePrediction

MatrixFactorization_MovieRecommendation

MF using MFTrainer. Evaluation done as regressions.

MulticlassClassification-GitHubLabeler

Regression-SalesForecast (eShopDashboardML)

AnomalyDetection-Sales

justinormont commented 6 years ago

One topic of note: we should be pushing a full retrain after your CV/TrainTest.

The full retraining on the entire dataset gives you a better model. The CV/TrainTest gives your pipeline's metrics { accuracy, AUC, NDCG, etc }. The full retraining, on 100% of the dataset, is the model to launch in production.

Hence we should form our examples as:

  1. Input dataset (where is my data)
  2. Loader function (define columns)
  3. Feature engineering (process raw input data)
  4. Learner (define my pipeline's model type & hyperparameters)
  5. TrainTest / CV (get metrics for my pipeline)
  6. (Look at metrics)
  7. (Iterate to improve metrics -- goto step 3 or 4)
  8. Retrain a model for production on 100% of data
  9. Productionize model (point user to samples on how to host trained models in an App / web service)
CESARDELATORRE commented 6 years ago

@justinormont - Good point. You mean then that in the cases where we have two datasets (training dataset and test dataset) we should also have the full-dataset (merging both) and show how you also must do a last train on 100% of data once you are okay with the metrics, right?

But... since that is the last part in a process to be done after the iterations trying to improve the metrics, do you think we should implement that code in the samples or only explain that process in the guidance?

We can certainly implement it in the sample and when you are still iterating you'd simply comment the code for the last train?

Thoughts?

shmoradims commented 6 years ago

Cesar, I know what Justin is talking about. I noticed this issue too, where CV models are used as the final model, instead of doing step 8 above. I know how to fix it.

CESARDELATORRE commented 6 years ago

Sure, I understand the issue. I'm just saying that if you are still iterating, the sample code shouldn't run the code to train with the full dataset at the end of the process, that's why you might want to comment that last execution until you want.. Let's chat about it offline. đź‘Ť

shmoradims commented 6 years ago

@justinormont, I have the following suggestions:

1) For cases where we have only one dataset, we can stick to your 9-step plan above. Do CV for evaluation and tuning, then train the final model on the one dataset, which is the full dataset.

2) I suggest we do not push for full training if the dataset already comes as separate train and test sets, like how most Kaggle and public datasets are. For competitions, they combine train and test sets, b/c there's another private test set for the final evaluation. So I think, for our samples, we should skip step #8. Mixing train and test sets should only be reserved for data scientists that know what they're doing and in cases when there's a second test set for final evaluation. I suggest keeping samples' ML at 100-level to match our audience, or we risk confusing them.

CESARDELATORRE commented 6 years ago

@shmoradims - I agree. Another variation between #1 and #2 if having a single dataset is to split in 80%-20% in memory the original full dataset and then train with the 80% DataView and Test the 20% Dataview. This is also simple to do and requires significant less time than CV.

shmoradims commented 6 years ago

Yes, instead of CV, we can split the data to 80% train, 20% test ourselves, and not mix it back again.

CESARDELATORRE commented 6 years ago

I removed the "Titanic" sample review because there are issues with its DataSet in regards licensing, as I was told by LCA, so we're removing this sample which at the end of the day was not very practical or enterprise oriented.

kunjee17 commented 6 years ago

@shmoradims is it possible to add F# in mix as well. Both are similar only but having Ok around it will surely help. F# do have little different style from coding point of view, so reviewer can validate that as well. cc/ @CESARDELATORRE @dsyme

prathyusha12345 commented 5 years ago

@shmoradims - Cesar told me to ask you if you can do a quick review of the current samples already migrated to 0.11. From data science, algorithms, etc. it should be pretty similar than when we had it in 0.8.

However, a few metrics are not good enough like in Sentiment Analysis (should be higher) and Iris classification (it is actually 1, like overfitting, should be lower). Could you review it, when possible, please?

PeterPann23 commented 5 years ago

Hi, I noticed that Normalization has been removed from the samples… perhaps one should explain the reasoning for this.

CESARDELATORRE commented 5 years ago

@PeterPann23 - What specific sample had normalization removed?

PeterPann23 commented 5 years ago

Have a look at the v1.0.0-preview-All-Samples and search for the API... I do not find much direct use. Bike-sharing mentions it in comments only.

CESARDELATORRE commented 5 years ago

@PeterPann23 - Normalization can be applied depending on each specific sample. Sometimes it makes sense, sometimes it doesn't, that's why I'm asking about what specific sample you think it should have normalization for certain columns?

PeterPann23 commented 5 years ago

I guess if not needed one should definite specify it in the samples. I was under the Assumtion it was always needed for the runtime data to match the static file.