dotnet / TorchSharp

A .NET library that provides access to the library that powers PyTorch.
MIT License
1.36k stars 174 forks source link

Proposal for common .NET tokenization library #248

Closed dsyme closed 11 months ago

dsyme commented 3 years ago

@NiklasGustafsson see https://github.com/fslaborg/FsLab/discussions/6

GeorgeS2019 commented 3 years ago

@dsyme there are valid reasons why HuggingFace dedicates an entire library ONLY on a wide variety of tokenization. Hopefully some of us have sufficient experience on that.

.NET community has a number of excellent NLP .NET projects. However, there is ONLY one that is truly transformer based (Seq2SeqSharp) and this is where the latest and future NLP AI will go, the very main motivation for the HuggingFace existence gaining industry standard.

GeorgeS2019 commented 3 years ago

I suggest with sufficient feedback, we have two tokenization libraries. NOT ONE SINGLE Common tokenization library. One dedicated to adhere as closely as possible to the HuggingFace and the other will be called the e.g. common tokenization library.

GeorgeS2019 commented 3 years ago

We are now sorting out here, please join the discussion

The list of tokenizers considered in TorchText.Data Utils.cs are

aorgish commented 3 years ago

Please, take a look at https://github.com/microsoft/BlingFire for tokenization

GeorgeS2019 commented 3 years ago

@NiklasGustafsson BlingFire is a good source of HuggingFace tokenizers [Maintained by Microsoft!!!] => Thanks @aorgish for sharing.

NiklasGustafsson commented 11 months ago

This will be addressed outside the scope of TorchSharp.