dotnet / TorchSharp

A .NET library that provides access to the library that powers PyTorch.
MIT License
1.38k stars 180 forks source link

Is TorchText implemented? #1381

Open przemyslawbak opened 2 weeks ago

przemyslawbak commented 2 weeks ago

For TorchSharp text classification example there is TorchText used to load data set.

I am not sure what I am doing wrong, but I can not find any usings to import this library.

For TorchSharp MNIST example I did manage to find and install proper NuGet to use torchvision.

Is TorchText implemented for .NET?

If not, alternatively, how can I load data from CSV file? I do not know what data type should be used for var reader in the example? Im confused.

yueyinqiu commented 2 weeks ago

I think we don't have torchtext support currently, and I've found the class in Examples.Utils.

NiklasGustafsson commented 2 weeks ago

We do not have that implemented.

Maybe @luisquintanilla can comment on some of the text-based preprocessing primitives we've added to ML.NET -- there's a few new tokenizers there, which should be usable with TorchSharp.

GeorgeS2019 commented 5 days ago

@LittleLittleCloud

Could you share your view which of the recent progress in ML.NET, regarding deep NLP, could be relevant for advancing TorchText project using TorchSharp?

References


TorchText from Pytorch # PyTorch TorchText [torchtext.nn](https://pytorch.org/text/stable/nn_modules.html#) [torchtext.data.functional](https://pytorch.org/text/stable/data_functional.html) [torchtext.data.metrics](https://pytorch.org/text/stable/data_metrics.html) [torchtext.data.utils](https://pytorch.org/text/stable/data_utils.html) [torchtext.datasets](https://pytorch.org/text/stable/datasets.html) [torchtext.vocab](https://pytorch.org/text/stable/vocab.html) [torchtext.utils](https://pytorch.org/text/stable/utils.html) [torchtext.transforms](https://pytorch.org/text/stable/transforms.html) [torchtext.functional](https://pytorch.org/text/stable/functional.html) [torchtext.models](https://pytorch.org/text/stable/models.html) # Tutorials - Text classification with [XLM-RoBERTa mode](https://pytorch.org/text/stable/tutorials/sst2_classification_non_distributed.html) - [T5-Base Model](https://pytorch.org/text/stable/tutorials/t5_demo.html) for Summarization, Sentiment Classification, and Translation --- # Tokenizers/Traansform from PyTorch https://pytorch.org/text/stable/transforms.html ## Tokenizers - [ ] SentencePieceTokenizer - [ ] GPT2BPETokenizer - [ ] CLIPTokenizer - [ ] RegexTokenizer - [ ] BERTTokenizer - [ ] CharBPETokenizer ## Transform - VocabTransform - PadTransform - StrToIntTransform ## Utils ToTensor LabelToIndex Truncate AddToken Sequential

Microsoft.ML.Tokenizers ## Microsoft.ML.Tokenizers - Microsoft.ML.Tokenizers - Microsoft.ML.Tokenizers.Data.Cl100kBase - Microsoft.ML.Tokenizers.Data.Gpt2 - Microsoft.ML.Tokenizers.Data.O200kBase - Microsoft.ML.Tokenizers.Data.P50kBase - Microsoft.ML.Tokenizers.Data.R50kBase ---- #[ Microsoft.ML.Tokenizers](https://github.com/dotnet/machinelearning/tree/main/src/Microsoft.ML.Tokenizers) ## Models - BPETokenizer.cs - BertTokenizer.cs - CodeGenTokenizer.cs - EnglishRobertaTokenizer.cs - LlamaTokenizer.cs - Phi2Tokenizer.cs - SentencePieceTokenizer.cs - TiktokenTokenizer.cs - WordPieceTokenizer.cs --- - Merge.cs - ModelSourceGenerationContext.cs - Pair.cs - Symbol.cs - Word.cs - Cache.cs ## Normalizers - BertNormalizer.cs - LowerCaseNormalizer.cs - Normalizer.cs - SentencePieceNormalizer.cs - UpperCaseNormalizer.cs ## PreTokenizers - PreTokenizer.cs - RegexPreTokenizer.cs - RobertaPreTokenizer.cs