Closed anuragshas closed 3 years ago
I'm wondering what a good approach here might be. The BytePair feature that we offer also does tokenization but only internally to the featurizer. We keep the original tokens intact that way and this is a requirement for our entity detection stack. For example;
My name is Vincent
This might get tokenized into
[My, name, is, Vin, cent]
As far as entity detection goes though, we want to return Vincent
not [Vin, cent]
.
If we use this technique as a tokenizer we might have a benefit for intents, but it would break entities as well as some lexican features later on. Instead, it might be worth the experiment to see if we can maybe use these tokens in a featurizer internally instead. But since the feature might become heavy it would be nice to get some sort of a confirmation that this idea has merit to it. That it improves a pipeline in a way that the other components can't. Have you done any work on this?
After talking with a colleague about this we wondered, have you ever worked with the ConveRT
tokeniser? For English it should already tokenize into subtokens already and it can still be used by countvectors/DIET to generate internal representations.
I am actually working on Indian Languages and lookup table doesn't seem to work, I have tried both WhiteSpace and stanza tokenisers, that's why I wanted to have custom pretrained tokeniser on my own data. I didn't find a way to train a polyai model(ConveRT
) and there is no mention of other languages.
Which Indian language specifically? I'm also looking at this library.
I am working on Hindi right now, will expand to Tamil, Telugu and Kannada
Have you tried any non-Sentencepiece tokenizers for those languages? I've googled and found a few but since I don't speak the languages I can't judge their quality. Have you seen this package or this one?
These are trivial tokenizers which does word and sentence tokenization. It won't be much different from Whitespace tokenization, purna viram and deerga viram are the ones different from English but they are used for sentence boundaries.
I'm open to the sentencepiece tokenizer as an experimental feature but we will need to keep in mind that the scope is just to generate these tokens for the intents for now. I fear that it is going to be very tricky to get this to work for entities but I'm interested in the experiment.
I've got no experience with SentencePiece so just to check. @anuragshas are these models available pretrained as well? We might need to think about a general corpus for different languages.
I have tried something similar to ConveRT
, it improved the entity f1 by 7 points but the entity predicted was subword and not the exact word as you had said earlier, even though there is the code for alignment
train_utils.align_tokens(split_token_strings, token_end, token_start)
with WhitespaceTokenizer tokens it doesn't seem to work
import os
from typing import Any, Dict, List, Text
import sentencepiece as spm
from rasa.nlu.tokenizers.tokenizer import Token
from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer
from rasa.nlu.training_data import Message
import rasa.utils.train_utils as train_utils
class SentencePieceTokenizer(WhitespaceTokenizer):
defaults = {
# Flag to check whether to split intents
"intent_tokenization_flag": False,
# Symbol on which intent should be split
"intent_split_symbol": "_",
# Text will be tokenized with case sensitive as default
"case_sensitive": True,
# specifies the path to a custom SentencePiece model file
"model_file": None,
}
def __init__(self, component_config: Dict[Text, Any] = None) -> None:
"""Construct a new tokenizer using the SentencePiece framework."""
super().__init__(component_config)
model_file = self.component_config["model_file"]
if model_file:
if not os.path.exists(model_file):
raise FileNotFoundError(
f"SentencePiece model {model_file} not found. Please check config."
)
self.model = spm.SentencePieceProcessor(model_file=model_file)
def _tokenize(self, sentence: Text) -> Any:
return self.model.encode(sentence, out_type=str)
def tokenize(self, message: Message, attribute: Text) -> List[Token]:
"""Tokenize the text using the SentencePiece model.
SentencePiece adds a special char in front of (some) words and splits words into
sub-words. To ensure the entity start and end values matches the token values,
tokenize the text first using the whitespace tokenizer. If individual tokens
are split up into multiple tokens, add this information to the
respected tokens.
"""
# perform whitespace tokenization
tokens_in = super().tokenize(message, attribute)
tokens_out = []
for token in tokens_in:
token_start, token_end, token_text = token.start, token.end, token.text
# use SentencePiece model to tokenize the text
split_token_strings = self._tokenize(token_text)
# clean tokens (remove special chars and empty tokens)
split_token_strings = self._clean_tokens(split_token_strings)
tokens_out += train_utils.align_tokens(
split_token_strings, token_end, token_start
)
return tokens_out
@staticmethod
def _clean_tokens(tokens: List[bytes]) -> List[Text]:
"""Encode tokens and remove special char added by ConveRT."""
tokens = [string.replace("_", "") for string in tokens]
return [string for string in tokens if string]
BPEmb for English has the pretrained SentencePiece model on wikipedia with different vocab capacity. 10000 vocab model would be good to test with
Interesting!
I have tried something similar to ConveRT, it improved the entity f1 by 7 points but the entity predicted was subword and not the exact word as you had said earlier, even though there is the code for alignment
Could you share some details on your config.yml
file? Was it just using countvectors from the tokens and DIET? Also, what dataset was used?
BPEmb for English has the pretrained SentencePiece model on wikipedia with different vocab capacity. 10000 vocab model would be good to test with.
I didn't know but I indeed just checked, it seems to depend on the same library.
Also a quick question, the model_file
that you're using here. Is that the same model file as from BPEmb
or are you training your own?
Here is my config.yml
file
pipeline:
- name: rasa_nlu_examples.tokenizers.SentencePieceTokenizer
lang: "hi"
model_file: "w2v_models/hi.xyz.bpe.vs10000.model"
- name: RegexFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 15
- name: rasa_nlu_examples.featurizers.dense.BytePairFeaturizer
lang: hi
vs: 10000
dim: 100
model_file: "w2v_models/hi.xyz.bpe.vs10000.model"
emb_file: "w2v_models/hi.xyz.bpe.vs10000.d100.w2v.bin"
- name: DIETClassifier
epochs: 200
- name: EntitySynonymMapper
language: "hi"
policies:
- name: TEDPolicy
epochs: 1
max_history: 3
batch_size:
- 32
- 64
- name: MappingPolicy
- name: AugmentedMemoizationPolicy
- name: TwoStageFallbackPolicy
nlu_threshold: 0.3
core_threshold: 0.3
fallback_core_action_name: "action_default_fallback"
fallback_nlu_action_name: "action_default_fallback"
deny_suggestion_intent_name: "out_of_scope"
Also, what dataset was used?
The dataset is translated text in Hindi in Pharma domain. I am not allowed to share it publicly.
I didn't know but I indeed just checked, it seems to depend on the same library
It is dependent only on sentencepiece library which is also a dependency of BPEmb. *.model
and *.vocab
are the files of sentencepiece model, *.bin
is gensim keyedvectors file which contains glove trained word2vec
Also a quick question, the model_file that you're using here. Is that the same model file as from BPEmb or are you training your own?
I had trained my own model using something similar to this
Again, interesting! The dataset that you trained on is a more general dataset than just your Rasa corpus? Also, did it work better than the standard BytePair embeddings?
You also mentioned a 7 point increase. Can you share anything about the size of your dataset? How many intents/entities/examples? Anything you can share about the domain?
One thing I wonder, could you use the WhiteSpaceTokenizer
with you current rasa_nlu_examples.featurizers.dense.BytePairFeaturizer
models? The intent
performance might remain the same. Looking at the current implementation we just take the full text, not the tokens separately, to detect the intent.
It might be good to check. If the performance doesn't change too much then we might focus on writing a tool that makes it easier to train your own byte-pair embeddings for Rasa.
Just a heads up. I can't make any promises on when it will be done. But I am now working on this.
This feature has been taken care of, at least partially, by our language model featurizer.
SentencePiece is generally used to create byte pairs in any language, as I can find there is no inbuilt support for this kind of tokenisation in rasa. Even though this library uses BPEmb but it is only limited for pretrained embeddings and not tokenisation, since Whitespace tokeniser doesn't always perform good, i would like to have support for it. I am willing to do PR for this, but I don't know about the contribution steps here.