Open ericstj opened 4 months ago
The WordPiece algorithm should be added to Microsoft.ML.Tokenizers. WordPiece algorithm is the basis for BERTTokenizer-based models. Needed for E5
We can see reference implementations in https://github.com/microsoft/BlingFire (MIT license) https://github.com/huggingface/tokenizers (Apache license)
The paper which it's based on: https://arxiv.org/abs/1609.08144 https://arxiv.org/pdf/1609.08144.pdf
The WordPiece algorithm should be added to Microsoft.ML.Tokenizers. WordPiece algorithm is the basis for BERTTokenizer-based models. Needed for E5
We can see reference implementations in https://github.com/microsoft/BlingFire (MIT license) https://github.com/huggingface/tokenizers (Apache license)
The paper which it's based on: https://arxiv.org/abs/1609.08144 https://arxiv.org/pdf/1609.08144.pdf