dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.92k stars 1.86k forks source link

[Tokenizers] Port BERTTokenizers #6991

Open ericstj opened 4 months ago

ericstj commented 4 months ago

Porting BERTTokenizers enables several text embedding generation models. Requires https://github.com/dotnet/machinelearning/issues/6988.

https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#text-embeddings. https://github.com/huggingface/transformers/blob/v4.37.0/src/transformers/models/bert/tokenization_bert.py#L137

cc @luisquintanilla

We already have some BERT implementation which may be sufficient.