dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.04k stars 1.88k forks source link

Update default n-gram length for Text Transform to match default text recipe #2870

Closed daholste closed 4 years ago

daholste commented 5 years ago

@justinormont and the text team tuned default n-gram lengths for the default text recipe in the internal repo

These defaults are: Word -- bigrams (w/ unigrams) Character -- trigrams (w/o unigrams and bigrams)

One chart from his findings: image

The line w/ the light blue call-out represents current ML.NET defaults (Unigram + Trichar) The line w/ the light green call-out is the requested change (Bigram + Trichar) The line w/ the pink call-out shows the Trigram+Trichar is better in terms of accuracy, but with a time hit, and accuracy has a cross over at NumIterations > 8 for Averaged Perceptron learner.

rogancarr commented 5 years ago

Related to #2802

zeahmed commented 5 years ago

@justinormont and @shauheen, do you want this to go in V1.0?

justinormont commented 5 years ago

That's up to @shauheen. I'd say yes, as there's strong upsides of accuracy. You'll notice the large jump in accuracy (y-axis) when we move from the blue to green lines in the above graph.

The power of defaults should never be underestimated.

Related: https://github.com/dotnet/machinelearning/issues/2305

najeeb-kazmi commented 4 years ago

Tracking in #4749