dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.91k stars 1.86k forks source link

Add Tiktoken's missing model names #7111

Closed tarekgh closed 2 months ago

tarekgh commented 2 months ago

Adding gpt-35-turbo model names used by Azure deployment.

tarekgh commented 2 months ago

CC @ericstj @luisquintanilla

stephentoub commented 2 months ago

Just thinking out loud, I'm wondering if rather than failing if we don't match a model name, if we should instead just default to the most recent vocab data we have.

ericstj commented 2 months ago

Just thinking out loud, I'm wondering if rather than failing if we don't match a model name, if we should instead just default to the most recent vocab data we have.

I don't think that's right. Someone might have had a typo and actually needs different data. For example: gpt-2 instead of gpt2

stephentoub commented 2 months ago

Just thinking out loud, I'm wondering if rather than failing if we don't match a model name, if we should instead just default to the most recent vocab data we have.

I don't think that's right. Someone might have had a typo and actually needs different data. For example: gpt-2 instead of gpt2

Yup, that's the downside. The upside is as new models are added, there's a higher liklihood they'll just work, and typos on the models folks mostly care about right now would just work.

Maybe what I'm really wondering about is having this be so loosey-goosey with supplying an arbitrary string name. Is there something better we can think of?

ericstj commented 2 months ago

We previously discussed allowing for the encoding names to be used as well. That could help if some new model came along with an existing encoding we knew about. If it's brand new then they need to provide the files and other info.

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 68.44%. Comparing base (4d5317e) to head (f0aab77). Report is 1 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7111 +/- ## ========================================== - Coverage 68.44% 68.44% -0.01% ========================================== Files 1263 1263 Lines 254834 254838 +4 Branches 26334 26334 ========================================== + Hits 174429 174430 +1 - Misses 73695 73697 +2 - Partials 6710 6711 +1 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7111/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7111/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.44% <100.00%> (-0.01%)` | :arrow_down: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7111/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `62.84% <100.00%> (-0.01%)` | :arrow_down: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7111/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `88.57% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/dotnet/machinelearning/pull/7111?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7111?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FTiktoken.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1Rpa3Rva2VuLmNz) | `69.55% <100.00%> (+0.25%)` | :arrow_up: | | [test/Microsoft.ML.Tokenizers.Tests/TitokenTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7111?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FTitokenTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9UaXRva2VuVGVzdHMuY3M=) | `100.00% <ø> (ø)` | | ... and [1 file with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7111/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)
tarekgh commented 2 months ago

I agree with @ericstj. I am seeing it is better to fail instead of returning something that may not match what the user want. Also, we can enhance the exception message to point at the table we are mapping so users can know the supported model names and which encoding files are used. This should make it easy for users to try one other supported name and move on.