Closed GeorgeS2019 closed 2 years ago
@GeorgeS2019 -- I have been thinking about the side-packages for a while. TorchText, TorchVision, etc. Because I didn't have a good idea of the full scenarios, I just added the functionality to the Examples, as utils. I think it's a good idea to start working on these packages for real, though.
I'm a little confused about Tutorial 6 -- it seems to be a vision tutorial, and I don't find any use of torchtext there, just torchvision.
That said, torchvision is another package that we need to get included, but I'll probably start working on the text functionality first.
@NiklasGustafsson The .NET Transformer framework Seq2SeqSharp has integrated, like TorchText, Transformers with Multi-Head Attention
The TorchText.NN module that provides the Transformers with Multi-Head Attention , in my view, is essential for TorchSharp
@zhongkaifu FYI
@NiklasGustafsson
There are 6 NLP tutorials, 3 are NLP from scratch, independent of TorchText. The other 3 are using TorchText (designed to make NLP deep learning design more industry standard)
@NiklasGustafsson
I'm a little confused about Tutorial 6 -- it seems to be a vision tutorial, and I don't find any use of torchtext there, just torchvision.
Perhaps it is more important to focus on the LAST remaining TorchText tutorials as listed above. The multihead implementation discussed in the tutorial 6 I quoted, if I am not wrong, was implemented independent of TorchText.NN's multihead
There is a lengthy discussion why the multihead feature has been transferred to TorchText.NN for various reasons that many of which are off my comprehension :-)
Thanks for those thoughts. The first of the two NLP tutorials that you list is, I believe, implemented here. The second one, the language translation example, should be a great one to tackle next.
The translation tutorial depends on the 'spacy' package for language tokenization, and I don't suspect there's something similar for .NET. This speaks to a broader need to specify, design, and prioritize data processing libraries for TorchSharp.
@NiklasGustafsson
FYI: Related discussions => Proposal - .NET tokenization library & Proposal for common .NET tokenization library
The Data Processing step anticipates different types of tokenizers and spacy is only one of them
TorchText.Data Utils.cs
de_tokenizer = get_tokenizer('spacy', language='de_core_news_sm')
en_tokenizer = get_tokenizer('spacy', language='en_core_web_sm')
The list of tokenizers considered in TorchText.Data Utils.cs are
Not all of them I have seen ported/made available to .NET
=> Interested to learn from others what are the possible substitutes for the above list of tokenizers without resorting to Python.NET
@zhongkaifu => any feedback?
I know sub-word tokenization is really useful for text generation tasks, such as MT task could get 2~3pt BLEU scores gain on average and some NN frameworks did integrate sub-word tokenization, such as Marian uses built-in SentencePiece for data processing step,
FYI: SentencePiece is being implemented in TorchText.Data => functional.py
if tokenizer == "spacy":
try:
import spacy
try:
spacy = spacy.load(language)
except IOError:
# Model shortcuts no longer work in spaCy 3.0+, try using fullnames
# List is from https://github.com/explosion/spaCy/blob/b903de3fcb56df2f7247e5b6cfa6b66f4ff02b62/spacy/errors.py#L789
OLD_MODEL_SHORTCUTS = spacy.errors.OLD_MODEL_SHORTCUTS if hasattr(spacy.errors, 'OLD_MODEL_SHORTCUTS') else {}
if language not in OLD_MODEL_SHORTCUTS:
raise
import warnings
warnings.warn(f'Spacy model "{language}" could not be loaded, trying "{OLD_MODEL_SHORTCUTS[language]}" instead')
spacy = spacy.load(OLD_MODEL_SHORTCUTS[language])
return partial(_spacy_tokenize, spacy=spacy)
except ImportError:
print("Please install SpaCy. "
"See the docs at https://spacy.io for more information.")
raise
except AttributeError:
print("Please install SpaCy and the SpaCy {} tokenizer. "
"See the docs at https://spacy.io for more "
"information.".format(language))
raise
elif tokenizer == "moses":
try:
from sacremoses import MosesTokenizer
moses_tokenizer = MosesTokenizer()
return moses_tokenizer.tokenize
except ImportError:
print("Please install SacreMoses. "
"See the docs at https://github.com/alvations/sacremoses "
"for more information.")
raise
elif tokenizer == "toktok":
try:
from nltk.tokenize.toktok import ToktokTokenizer
toktok = ToktokTokenizer()
return toktok.tokenize
except ImportError:
print("Please install NLTK. "
"See the docs at https://nltk.org for more information.")
raise
elif tokenizer == 'revtok':
try:
import revtok
return revtok.tokenize
except ImportError:
print("Please install revtok.")
raise
elif tokenizer == 'subword':
try:
import revtok
return partial(revtok.tokenize, decap=True)
except ImportError:
print("Please install revtok.")
raise
raise ValueError("Requested tokenizer {}, valid choices are a "
"callable that takes a single string as input, "
"\"revtok\" for the revtok reversible tokenizer, "
"\"subword\" for the revtok caps-aware tokenizer, "
"\"spacy\" for the SpaCy English tokenizer, or "
"\"moses\" for the NLTK port of the Moses tokenization "
"script.".format(tokenizer))
@GeorgeS2019 -- in light of more recent discussions about tokenization, I'm closing this issue as outdated. If you disagree, please reopen with an explanation of what you think needs to be tracked.
The recent example: SequenceToSequence.cs represents an excellent example of [Tutorial 1]: SEQUENCE-TO-SEQUENCE MODELING WITH NN.TRANSFORMER AND TORCHTEXT
Together with the SequenceToSequence Modeling example, a number of TorchText classes were implemented:
Organized according to TorchText namespaces
It will make the TorchSharp more complete in term of TorchText if the following feature and example is implemented
[Tutorial 6]: Transformers and Multi-Head Attention for TorchText.NN multiheadattention.py