Closed gongshaojie12 closed 3 years ago
Hi @sara-tagger @degiz any progress?
Hi @sara-tagger @degiz How is this bug? Any progress?
Hi @gongshaojie12 what do your regex patterns for RegexFeaturizer
look like?
@koaning any thoughts on this?
I am unfamiliar with JiebaTokenizer
, so it's hard for me to guess if there's an issue with mixing English and Chinese there. It might also be related to the BERT model you've added. Can you confirm if the issue goes away if you remove that component?
@gongshaojie12 this is just something to try; we've recently added support for spaCy 3.0 which also supports Chinese models. Can you confirm if the issue persists with the spaCy tokenizers for Chinese?
@gongshaojie12 locally in my notebook I can confirm that spaCy seems to have a reasonable way of dealing of splitting up the tokens. I don't speak Chinese however, so feel free to correct me.
import spacy
text = "如何才能在下载和安装google app"
nlp = spacy.blank("zh")
for t in nlp(text):
print(t, t.idx)
This is the output:
如何 0
才能 2
在 4
下载 5
和 7
安装 8
google 10
app 17
Hi @koaning Thank you for your reply.I rewritten the jiebatokenizer class to solve this problem. Thanks!
Could you explain what you've changed? If you have any lessons to share we might be able to think about improvements to our components for any other users.
Hi @koaning Because the tokenize method in JiebaTokenizer does not remove the spaces after the word segmentation is completed, and subsequent components will remove the spaces when extracting text features, which will cause inconsistencies. So I rewrote the tokenize method to remove the space after the word segmentation.
Based on my investigation using the provided examples, the problem is exactly what @gongshaojie12 found. I'll explain it here with an example:
JiebaTokenizer
is meant for Chinese only text. When multiple languages are used in the same sentence, the tokenizer adds an extra whitespace token in between two chinese and english tokens. For example,
text
: 如何才能在下载和安装google app
Tokens output by the tokenizer will be: ['如何', '才能', '在', '下载', '和', '安装', ' ', 'google', 'app']
The extra whitespace added in between is the culprit. What @gongshaojie12 tried as a solution will work very well if you have no entities in the NLU pipeline. However, if you have entities and you remove the whitespace as a post-processing task after the tokenizer, the entity alignment will be messed up. This is because start and end spans of entity annotations are recorded when the data is loaded up and before the tokenizer processes the messages. @gongshaojie12 I would caution you of this potential pitfall. If you don't have any entities you can definitely use your solution.
I haven't been able to try the spacy tokenizer on this though because downloading the spacy model took a long time and then errored out. @koaning If you already have the zh_core_web_md
model downloaded locally could you check what tokens are created by the tokenizer?
I see two follow up issues:
We need to re-evaluate our tokenization approach for non-whitespace splittable languages. Currently our recommendation is to use JiebaTokenizer
for tokenization and LanguageModelFeaturizer
with a chinese model loaded. There also the entity misalignment problem can happen because of the mismatch in expected tokens v/s created tokens by Jieba. Moreover, it's not clear what our recommendation is for multilingual sentences / sentences with code-switching. We need to explore and evaluate our options better for this setting.
The entity misalignment issue happens because the entity spans are recorded before the tokens are created in the pipeline. This is wrong. As evident above, the tokenization for non-whitespace tokenizable languages is slightly non-deterministic and depends on what tokenizer is used under the hood (for e.g. - jieba, bert based tokenizer, etc.). Hence, the cleaner way to support such languages for entity recognition would be to record the entity annotation spans after the tokens have been created.
@TyDunn Here are the two follow up issues:
https://github.com/RasaHQ/rasa/issues/8722 https://github.com/RasaHQ/rasa/issues/8723
Could you please create an ice box item and place them in there?
@dakshvar22 the medium model does the same as the tokenizer that I tried earlier. That's to my knowledge also how spaCy is designed, the tokenizer is the same across the blank/sm/md/lg/trf models. Interesting observation: spaCy depends on Jieba here, but it seems to add extra behavior.
import spacy
nlp = spacy.load('zh_core_web_md')
# Building prefix dict from the default dictionary ...
# Dumping model to file cache /tmp/jieba.cache
# Loading model cost 0.499 seconds.
# Prefix dict has been built successfully.
[t for t in nlp("如何才能在下载和安装google app")]
# [如何, 才能, 在, 下载, 和, 安装, google, app]
@koaning That's why I had recommended to try loading zh_core_web_md
inside our JiebaTokenizer
and see what are the tokens produced. But if you are confident they are using Jieba underneath, the results would probably be the same except the extra whitespace.
I would assume spacy is filtering off the whitespace tokens produced which is logical to do but we can't do it because of the entity mis-alignment problem I described above.
@TyDunn Created the product ice box idea here. Let me know if the issue can be closed now according to the definition of done.
@dakshvar22 you are too quick! This was on my list of things to do today, but now I don't have to, thanks :)
Hi @koaning Because the tokenize method in JiebaTokenizer does not remove the spaces after the word segmentation is completed, and subsequent components will remove the spaces when extracting text features, which will cause inconsistencies. So I rewrote the tokenize method to remove the space after the word segmentation.
Hello, I have the same issue here, could you share more on how to rewrite JiebaTokenizer to solve this issue? Thanks
Hi @ljcljc as follows:
import glob
import logging
import os
import shutil
import typing
from typing import Any, Dict, List, Optional, Text
from rasa.nlu.components import Component
from rasa.nlu.tokenizers.jieba_tokenizer import JiebaTokenizer
from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer
from rasa.shared.nlu.training_data.message import Message
logger = logging.getLogger(__name__)
if typing.TYPE_CHECKING:
from rasa.nlu.model import Metadata
class JiebaTokenizerCustom(JiebaTokenizer):
"""This tokenizer is a wrapper for Jieba (https://github.com/fxsjy/jieba)."""
def tokenize(self, message: Message, attribute: Text) -> List[Token]:
import jieba
text = message.get(attribute)
tokenized = jieba.tokenize(text)
tokens = [Token(word, start) for (word, start, end) in tokenized if word.strip()]
return self._apply_token_pattern(tokens)
Hi @ljcljc as follows:
import glob import logging import os import shutil import typing from typing import Any, Dict, List, Optional, Text from rasa.nlu.components import Component from rasa.nlu.tokenizers.jieba_tokenizer import JiebaTokenizer from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer from rasa.shared.nlu.training_data.message import Message logger = logging.getLogger(__name__) if typing.TYPE_CHECKING: from rasa.nlu.model import Metadata class JiebaTokenizerCustom(JiebaTokenizer): """This tokenizer is a wrapper for Jieba (https://github.com/fxsjy/jieba).""" def tokenize(self, message: Message, attribute: Text) -> List[Token]: import jieba text = message.get(attribute) tokenized = jieba.tokenize(text) tokens = [Token(word, start) for (word, start, end) in tokenized if word.strip()] return self._apply_token_pattern(tokens)
Thanks a lot for the code, it works for me now. One more question about this, I saw the reply by @dakshvar22 about the potential pitfall for entities. Do you have any concern about that? I want to use it in production and want to know about the possible issues it might have.
Hi, @ljcljc Since I don't currently have the need for slot filling, I didn't consider the entity recognition. If you have research in the entity, please share your thoughts, thank you!
I don't have deep research in entity filling now. But I want to use it in my project, will let you know once I find any issue after deployment. Thanks for you work.
Rasa version:2.2.3
Rasa SDK version (if used & relevant):
Rasa X version (if used & relevant):
Python version:3.6.12
Operating system (windows, osx, ...):windows and linux
Issue: When the Chinese training data contains English and spaces, the DIETClassifier cannot be used for training
Error (including full traceback):
Command or request that led to error:
Content of configuration file (config.yml) (if relevant):
Content of domain file (domain.yml) (if relevant):
Content of nlu file (nlu.yml) (if relevant):
The reason for DIETClassifier training error is because of the space between
google
andapp
in the sentence "如何才能在下载和安装google app", If remove the space, DIETClassifier will train normally.How can I solve this problem, please help me, thanks!
Definition of done Within 2 working days: