dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.92k stars 1.86k forks source link

[Tokenizers] Implement WordPiece algorithm #6988

Open ericstj opened 4 months ago

ericstj commented 4 months ago

The WordPiece algorithm should be added to Microsoft.ML.Tokenizers. WordPiece algorithm is the basis for BERTTokenizer-based models. Needed for E5

We can see reference implementations in https://github.com/microsoft/BlingFire (MIT license) https://github.com/huggingface/tokenizers (Apache license)

The paper which it's based on: https://arxiv.org/abs/1609.08144 https://arxiv.org/pdf/1609.08144.pdf