Closed dsyme closed 11 months ago
@dsyme there are valid reasons why HuggingFace dedicates an entire library ONLY on a wide variety of tokenization. Hopefully some of us have sufficient experience on that.
.NET community has a number of excellent NLP .NET projects. However, there is ONLY one that is truly transformer based (Seq2SeqSharp) and this is where the latest and future NLP AI will go, the very main motivation for the HuggingFace existence gaining industry standard.
I suggest with sufficient feedback, we have two tokenization libraries. NOT ONE SINGLE Common tokenization library. One dedicated to adhere as closely as possible to the HuggingFace and the other will be called the e.g. common tokenization library.
We are now sorting out here, please join the discussion
The list of tokenizers considered in TorchText.Data Utils.cs are
Please, take a look at https://github.com/microsoft/BlingFire for tokenization
@NiklasGustafsson BlingFire is a good source of HuggingFace tokenizers [Maintained by Microsoft!!!] => Thanks @aorgish for sharing.
This will be addressed outside the scope of TorchSharp.
@NiklasGustafsson see https://github.com/fslaborg/FsLab/discussions/6