Everything is showing True in Is Toxic.

dotnet / machinelearning-modelbuilder

Simple UI tool to build custom machine learning models.

Creative Commons Attribution 4.0 International

264 stars 56 forks source link

Everything is showing True in Is Toxic. #308

Closed piyushag-git closed 4 years ago

piyushag-git commented 4 years ago

Problem encountered on https://dotnet.microsoft.com/learn/ml-dotnet/get-started-tutorial/evaluate Operating System: windows

Provide details about the problem you are experiencing. Include your operating system version, exact error message, code sample, and anything else that is relevant.

JakeRadMSFT commented 4 years ago

Hello, Thanks for reporting! Can you expand on everything? Have you tried specific examples from the test data source? Since the tutorial is training on a small dataset it might not be that accurate for custom text entered in.

Please train with a larger dataset or try examples from the training dataset.

CESARDELATORRE commented 4 years ago

I recommend to try the YELP sentiment analysis dataset. It's not being used in the getting started tutorial because we have no legal approval for that dataset plus it doesn't come with a header in the file, so you'll need to edit the file and add the column names if you want to use it from Model Builder.

Try it out: https://archive.ics.uci.edu/ml/machine-learning-databases/00331/sentiment%20labelled%20sentences.zip

If using the CLI, here are the steps: https://docs.microsoft.com/en-us/dotnet/machine-learning/tutorials/mlnet-cli?tabs=windows

justinormont commented 4 years ago

We should increase the size of the sample used in the getting started guide.

The 250-line sampled dataset is currently sized for a unit test; its small size is not valuable for example code, where a useful model is needed. Perhaps we should increase the size of the sample for use in the example. If the same dataset is used in unit tests, we should ensure the runtime of the unit tests are not adversely impacted. The only point of the unit tests are to see if something changed; they are not expected to make useful models.

Similar issues were mentioned in the main repo: https://github.com/dotnet/machinelearning/issues/708#issuecomment-425706329:

Try running on the full-sized dataset: https://aka.ms/tlc-resources/benchmarks/WikiDetoxAnnotated160kRows.tsv

The wikipedia-detox-250-line-data.tsv dataset is a 250 row sample of the original 160k rows. Training on this small of a sample won't create a useful model.

JakeRadMSFT commented 4 years ago

Resolving due to no activity. If you've tried a larger dataset and are still hitting this issue, please re-open. We're tracking the task of improving our sample data set with #323.

Thanks.