dotnet / TorchSharp

A .NET library that provides access to the library that powers PyTorch.
MIT License
1.41k stars 182 forks source link

Missing TorchText.NN feature #257

Closed GeorgeS2019 closed 2 years ago

GeorgeS2019 commented 3 years ago

The recent example: SequenceToSequence.cs represents an excellent example of [Tutorial 1]: SEQUENCE-TO-SEQUENCE MODELING WITH NN.TRANSFORMER AND TORCHTEXT

Together with the SequenceToSequence Modeling example, a number of TorchText classes were implemented:

Organized according to TorchText namespaces

It will make the TorchSharp more complete in term of TorchText if the following feature and example is implemented

NiklasGustafsson commented 3 years ago

@GeorgeS2019 -- I have been thinking about the side-packages for a while. TorchText, TorchVision, etc. Because I didn't have a good idea of the full scenarios, I just added the functionality to the Examples, as utils. I think it's a good idea to start working on these packages for real, though.

I'm a little confused about Tutorial 6 -- it seems to be a vision tutorial, and I don't find any use of torchtext there, just torchvision.

That said, torchvision is another package that we need to get included, but I'll probably start working on the text functionality first.

GeorgeS2019 commented 3 years ago

@NiklasGustafsson The .NET Transformer framework Seq2SeqSharp has integrated, like TorchText, Transformers with Multi-Head Attention

image

The TorchText.NN module that provides the Transformers with Multi-Head Attention , in my view, is essential for TorchSharp

@zhongkaifu FYI

GeorgeS2019 commented 3 years ago

@NiklasGustafsson

There are 6 NLP tutorials, 3 are NLP from scratch, independent of TorchText. The other 3 are using TorchText (designed to make NLP deep learning design more industry standard)

GeorgeS2019 commented 3 years ago

@NiklasGustafsson

I'm a little confused about Tutorial 6 -- it seems to be a vision tutorial, and I don't find any use of torchtext there, just torchvision.

Perhaps it is more important to focus on the LAST remaining TorchText tutorials as listed above. The multihead implementation discussed in the tutorial 6 I quoted, if I am not wrong, was implemented independent of TorchText.NN's multihead

There is a lengthy discussion why the multihead feature has been transferred to TorchText.NN for various reasons that many of which are off my comprehension :-)

NiklasGustafsson commented 3 years ago

Thanks for those thoughts. The first of the two NLP tutorials that you list is, I believe, implemented here. The second one, the language translation example, should be a great one to tackle next.

NiklasGustafsson commented 3 years ago

The translation tutorial depends on the 'spacy' package for language tokenization, and I don't suspect there's something similar for .NET. This speaks to a broader need to specify, design, and prioritize data processing libraries for TorchSharp.

GeorgeS2019 commented 3 years ago

@NiklasGustafsson

FYI: Related discussions => Proposal - .NET tokenization library & Proposal for common .NET tokenization library

The Data Processing step anticipates different types of tokenizers and spacy is only one of them

TorchText.Data Utils.cs

de_tokenizer = get_tokenizer('spacy', language='de_core_news_sm')
en_tokenizer = get_tokenizer('spacy', language='en_core_web_sm')

The list of tokenizers considered in TorchText.Data Utils.cs are

Not all of them I have seen ported/made available to .NET

=> Interested to learn from others what are the possible substitutes for the above list of tokenizers without resorting to Python.NET

@zhongkaifu => any feedback?

I know sub-word tokenization is really useful for text generation tasks, such as MT task could get 2~3pt BLEU scores gain on average and some NN frameworks did integrate sub-word tokenization, such as Marian uses built-in SentencePiece for data processing step,

FYI: SentencePiece is being implemented in TorchText.Data => functional.py

if tokenizer == "spacy":
        try:
            import spacy
            try:
                spacy = spacy.load(language)
            except IOError:
                # Model shortcuts no longer work in spaCy 3.0+, try using fullnames
                # List is from https://github.com/explosion/spaCy/blob/b903de3fcb56df2f7247e5b6cfa6b66f4ff02b62/spacy/errors.py#L789
                OLD_MODEL_SHORTCUTS = spacy.errors.OLD_MODEL_SHORTCUTS if hasattr(spacy.errors, 'OLD_MODEL_SHORTCUTS') else {}
                if language not in OLD_MODEL_SHORTCUTS:
                    raise
                import warnings
                warnings.warn(f'Spacy model "{language}" could not be loaded, trying "{OLD_MODEL_SHORTCUTS[language]}" instead')
                spacy = spacy.load(OLD_MODEL_SHORTCUTS[language])
            return partial(_spacy_tokenize, spacy=spacy)
        except ImportError:
            print("Please install SpaCy. "
                  "See the docs at https://spacy.io for more information.")
            raise
        except AttributeError:
            print("Please install SpaCy and the SpaCy {} tokenizer. "
                  "See the docs at https://spacy.io for more "
                  "information.".format(language))
            raise
    elif tokenizer == "moses":
        try:
            from sacremoses import MosesTokenizer
            moses_tokenizer = MosesTokenizer()
            return moses_tokenizer.tokenize
        except ImportError:
            print("Please install SacreMoses. "
                  "See the docs at https://github.com/alvations/sacremoses "
                  "for more information.")
            raise
    elif tokenizer == "toktok":
        try:
            from nltk.tokenize.toktok import ToktokTokenizer
            toktok = ToktokTokenizer()
            return toktok.tokenize
        except ImportError:
            print("Please install NLTK. "
                  "See the docs at https://nltk.org  for more information.")
            raise
    elif tokenizer == 'revtok':
        try:
            import revtok
            return revtok.tokenize
        except ImportError:
            print("Please install revtok.")
            raise
    elif tokenizer == 'subword':
        try:
            import revtok
            return partial(revtok.tokenize, decap=True)
        except ImportError:
            print("Please install revtok.")
            raise
    raise ValueError("Requested tokenizer {}, valid choices are a "
                     "callable that takes a single string as input, "
                     "\"revtok\" for the revtok reversible tokenizer, "
                     "\"subword\" for the revtok caps-aware tokenizer, "
                     "\"spacy\" for the SpaCy English tokenizer, or "
                     "\"moses\" for the NLTK port of the Moses tokenization "
                     "script.".format(tokenizer))
NiklasGustafsson commented 2 years ago

@GeorgeS2019 -- in light of more recent discussions about tokenization, I'm closing this issue as outdated. If you disagree, please reopen with an explanation of what you think needs to be tracked.