Chinese whitespace error

gongshaojie12 commented 3 years ago

Rasa version:2.2.3

Rasa SDK version (if used & relevant):

Rasa X version (if used & relevant):

Python version:3.6.12

Operating system (windows, osx, ...):windows and linux

Issue: When the Chinese training data contains English and spaces, the DIETClassifier cannot be used for training

Error (including full traceback):

2021-02-09 16:39:32.688396: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2021-02-09 16:39:32.688662: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-02-09 16:39:40.660431: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library nvcuda.dll
2021-02-09 16:39:41.315725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce MX150 computeCapability: 6.1
coreClock: 1.5315GHz coreCount: 3 deviceMemorySize: 2.00GiB deviceMemoryBandwidth: 44.76GiB/s
2021-02-09 16:39:41.319505: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2021-02-09 16:39:41.323006: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cublas64_10.dll'; dlerror: cublas64_10.dll not found
2021-02-09 16:39:41.326474: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cufft64_10.dll'; dlerror: cufft64_10.dll not found
2021-02-09 16:39:41.330811: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'curand64_10.dll'; dlerror: curand64_10.dll not found
2021-02-09 16:39:41.334577: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found
2021-02-09 16:39:41.338017: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cusparse64_10.dll'; dlerror: cusparse64_10.dll not found
2021-02-09 16:39:41.347054: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll
2021-02-09 16:39:41.347251: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-02-09 16:39:41.348086: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-02-09 16:39:41.356941: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x242a54dba00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-02-09 16:39:41.357206: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-02-09 16:39:41.357492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-09 16:39:41.357688: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      
Some layers from the model checkpoint at bert-base-chinese were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-chinese.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
2021-02-09 16:39:43 INFO     rasa.nlu.components  - Added 'LanguageModelFeaturizer' to component cache. Key 'LanguageModelFeaturizer-bert-68d7c530c1c4708f5657e4ae28219570'.
2021-02-09 16:39:43 INFO     rasa.nlu.model  - Starting to train component JiebaTokenizer
Building prefix dict from the default dictionary ...
Loading model from cache D:\TEMP\jieba.cache
Loading model cost 1.013 seconds.
Prefix dict has been built successfully.
2021-02-09 16:39:44 INFO     rasa.nlu.model  - Finished training component.
2021-02-09 16:39:44 INFO     rasa.nlu.model  - Starting to train component RegexFeaturizer
2021-02-09 16:39:44 INFO     rasa.nlu.model  - Finished training component.
2021-02-09 16:39:44 INFO     rasa.nlu.model  - Starting to train component LexicalSyntacticFeaturizer
2021-02-09 16:39:44 INFO     rasa.nlu.model  - Finished training component.
2021-02-09 16:39:44 INFO     rasa.nlu.model  - Starting to train component LanguageModelFeaturizer
2021-02-09 16:39:44 INFO     rasa.nlu.model  - Finished training component.
2021-02-09 16:39:44 INFO     rasa.nlu.model  - Starting to train component DIETClassifier
Epochs:   0%|          | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "D:/NLU/test-project/run.py", line 112, in <module>
    train_nlu()
  File "D:/NLU/test-project/run.py", line 70, in train_nlu
    trainer.train(training_data)
  File "D:\software\anaconda\envs\test-project\lib\site-packages\rasa\nlu\model.py", line 209, in train
    updates = component.train(working_data, self.config, **context)
  File "D:\software\anaconda\envs\test-project\lib\site-packages\rasa\nlu\classifiers\diet_classifier.py", line 818, in train
    self.component_config[BATCH_STRATEGY],
  File "D:\software\anaconda\envs\test-project\lib\site-packages\rasa\utils\tensorflow\models.py", line 242, in fit
    self.train_summary_writer,
  File "D:\software\anaconda\envs\test-project\lib\site-packages\rasa\utils\tensorflow\models.py", line 438, in _batch_loop
    call_model_function(batch_in)
  File "D:\software\anaconda\envs\test-project\lib\site-packages\tensorflow\python\eager\def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "D:\software\anaconda\envs\test-project\lib\site-packages\tensorflow\python\eager\def_function.py", line 807, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "D:\software\anaconda\envs\test-project\lib\site-packages\tensorflow\python\eager\function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "D:\software\anaconda\envs\test-project\lib\site-packages\tensorflow\python\eager\function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "D:\software\anaconda\envs\test-project\lib\site-packages\tensorflow\python\eager\function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "D:\software\anaconda\envs\test-project\lib\site-packages\tensorflow\python\eager\function.py", line 550, in call
    ctx=ctx)
  File "D:\software\anaconda\envs\test-project\lib\site-packages\tensorflow\python\eager\execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError:  ConcatOp : Dimensions of inputs should match: shape[0] = [4,9,128] vs. shape[1] = [4,8,768]
     [[node concat (defined at \software\anaconda\envs\test-project\lib\site-packages\rasa\utils\tensorflow\models.py:929) ]] [Op:__inference_train_on_batch_12510]

Errors may have originated from an input operation.
Input Source operations connected to node concat:
 dropout_39/dropout/Mul_1 (defined at \software\anaconda\envs\test-project\lib\site-packages\rasa\utils\tensorflow\models.py:918)   
 batch_in_9 (defined at \software\anaconda\envs\test-project\lib\site-packages\rasa\utils\tensorflow\models.py:464)

Function call stack:
train_on_batch

Process finished with exit code 1

Command or request that led to error:

Content of configuration file (config.yml) (if relevant):

# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: zh

pipeline:
# # No configuration for the NLU pipeline was provided. The following default pipeline was used to train your model.
# # If you'd like to customize it, uncomment and adjust the pipeline.
# # See https://rasa.com/docs/rasa/tuning-your-model for more information.
   - name: JiebaTokenizer
   - name: RegexFeaturizer
   - name: LexicalSyntacticFeaturizer
   - name: LanguageModelFeaturizer
     model_name: bert
     model_weights: bert-base-chinese
     cache_dir: null
   - name: DIETClassifier
     epochs: 1
   - name: EntitySynonymMapper
   - name: ResponseSelector
     epochs: 100
   - name: FallbackClassifier
     threshold: 0.3
     ambiguity_threshold: 0.1

# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
# # No configuration for policies was provided. The following default policies were used to train your model.
# # If you'd like to customize them, uncomment and adjust the policies.
# # See https://rasa.com/docs/rasa/policies for more information.
   - name: AugmentedMemoizationPolicy
   - name: TEDPolicy
     max_history: 8
     epochs: 200
     hidden_layers_sizes:
       dialogue: [256, 128]
   - name: RulePolicy

Content of domain file (domain.yml) (if relevant):

Content of nlu file (nlu.yml) (if relevant):

version: "2.0"

nlu:
- intent: greet
  examples: |
    - 嘿
    - hello 中国helloword

- intent: download
  examples: |
    - 下载google
    - 如何才能在下载和安装google app

The reason for DIETClassifier training error is because of the space between google and app in the sentence "如何才能在下载和安装google app", If remove the space, DIETClassifier will train normally.

How can I solve this problem, please help me, thanks!

Definition of done Within 2 working days:

[x] Identify the source of the problem
[x] Reply to the user to let them know what was found
[x] Create a separate issue that outlines possible solutions for code switching between Chinese and English
[x] Work with product management to create an ice box item for code switching between whitespace tokenizable and not whitespace tokenizable language families, linking these two issues

sara-tagger commented 3 years ago

Thanks for raising this issue, @degiz will get back to you about it soon✨

Please also check out the docs and the forum in case your issue was raised there too 🤗

gongshaojie12 commented 3 years ago

Hi @sara-tagger @degiz any progress?

gongshaojie12 commented 3 years ago

Hi @sara-tagger @degiz How is this bug? Any progress?

koernerfelicia commented 3 years ago

Hi @gongshaojie12 what do your regex patterns for RegexFeaturizer look like?

alopez commented 3 years ago

@koaning any thoughts on this?

koaning commented 3 years ago

I am unfamiliar with JiebaTokenizer, so it's hard for me to guess if there's an issue with mixing English and Chinese there. It might also be related to the BERT model you've added. Can you confirm if the issue goes away if you remove that component?

@gongshaojie12 this is just something to try; we've recently added support for spaCy 3.0 which also supports Chinese models. Can you confirm if the issue persists with the spaCy tokenizers for Chinese?

koaning commented 3 years ago

@gongshaojie12 locally in my notebook I can confirm that spaCy seems to have a reasonable way of dealing of splitting up the tokens. I don't speak Chinese however, so feel free to correct me.

import spacy
text = "如何才能在下载和安装google app"

nlp = spacy.blank("zh")
for t in nlp(text):
    print(t, t.idx)

This is the output:

如何 0
才能 2
在 4
下载 5
和 7
安装 8
google 10
app 17

gongshaojie12 commented 3 years ago

Hi @koaning Thank you for your reply.I rewritten the jiebatokenizer class to solve this problem. Thanks!

koaning commented 3 years ago

Could you explain what you've changed? If you have any lessons to share we might be able to think about improvements to our components for any other users.

gongshaojie12 commented 3 years ago

Hi @koaning Because the tokenize method in JiebaTokenizer does not remove the spaces after the word segmentation is completed, and subsequent components will remove the spaces when extracting text features, which will cause inconsistencies. So I rewrote the tokenize method to remove the space after the word segmentation.

dakshvar22 commented 3 years ago

Based on my investigation using the provided examples, the problem is exactly what @gongshaojie12 found. I'll explain it here with an example:

JiebaTokenizer is meant for Chinese only text. When multiple languages are used in the same sentence, the tokenizer adds an extra whitespace token in between two chinese and english tokens. For example, text: 如何才能在下载和安装google app Tokens output by the tokenizer will be: ['如何', '才能', '在', '下载', '和', '安装', ' ', 'google', 'app']

The extra whitespace added in between is the culprit. What @gongshaojie12 tried as a solution will work very well if you have no entities in the NLU pipeline. However, if you have entities and you remove the whitespace as a post-processing task after the tokenizer, the entity alignment will be messed up. This is because start and end spans of entity annotations are recorded when the data is loaded up and before the tokenizer processes the messages. @gongshaojie12 I would caution you of this potential pitfall. If you don't have any entities you can definitely use your solution.

I haven't been able to try the spacy tokenizer on this though because downloading the spacy model took a long time and then errored out. @koaning If you already have the zh_core_web_md model downloaded locally could you check what tokens are created by the tokenizer?

I see two follow up issues:

We need to re-evaluate our tokenization approach for non-whitespace splittable languages. Currently our recommendation is to use JiebaTokenizer for tokenization and LanguageModelFeaturizer with a chinese model loaded. There also the entity misalignment problem can happen because of the mismatch in expected tokens v/s created tokens by Jieba. Moreover, it's not clear what our recommendation is for multilingual sentences / sentences with code-switching. We need to explore and evaluate our options better for this setting.
The entity misalignment issue happens because the entity spans are recorded before the tokens are created in the pipeline. This is wrong. As evident above, the tokenization for non-whitespace tokenizable languages is slightly non-deterministic and depends on what tokenizer is used under the hood (for e.g. - jieba, bert based tokenizer, etc.). Hence, the cleaner way to support such languages for entity recognition would be to record the entity annotation spans after the tokens have been created.

dakshvar22 commented 3 years ago

@TyDunn Here are the two follow up issues:

https://github.com/RasaHQ/rasa/issues/8722 https://github.com/RasaHQ/rasa/issues/8723

Could you please create an ice box item and place them in there?

koaning commented 3 years ago

@dakshvar22 the medium model does the same as the tokenizer that I tried earlier. That's to my knowledge also how spaCy is designed, the tokenizer is the same across the blank/sm/md/lg/trf models. Interesting observation: spaCy depends on Jieba here, but it seems to add extra behavior.

import spacy

nlp = spacy.load('zh_core_web_md')
# Building prefix dict from the default dictionary ...
# Dumping model to file cache /tmp/jieba.cache
# Loading model cost 0.499 seconds.
# Prefix dict has been built successfully.

[t for t in nlp("如何才能在下载和安装google app")]
# [如何, 才能, 在, 下载, 和, 安装, google, app]

dakshvar22 commented 3 years ago

@koaning That's why I had recommended to try loading zh_core_web_md inside our JiebaTokenizer and see what are the tokens produced. But if you are confident they are using Jieba underneath, the results would probably be the same except the extra whitespace. I would assume spacy is filtering off the whitespace tokens produced which is logical to do but we can't do it because of the entity mis-alignment problem I described above.

dakshvar22 commented 3 years ago

@TyDunn Created the product ice box idea here. Let me know if the issue can be closed now according to the definition of done.

TyDunn commented 3 years ago

@dakshvar22 you are too quick! This was on my list of things to do today, but now I don't have to, thanks :)

ljcljc commented 2 years ago

Hi @koaning Because the tokenize method in JiebaTokenizer does not remove the spaces after the word segmentation is completed, and subsequent components will remove the spaces when extracting text features, which will cause inconsistencies. So I rewrote the tokenize method to remove the space after the word segmentation.

Hello, I have the same issue here, could you share more on how to rewrite JiebaTokenizer to solve this issue? Thanks

gongshaojie12 commented 2 years ago

Hi @ljcljc as follows:

import glob
import logging
import os
import shutil
import typing
from typing import Any, Dict, List, Optional, Text

from rasa.nlu.components import Component
from rasa.nlu.tokenizers.jieba_tokenizer import JiebaTokenizer
from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer
from rasa.shared.nlu.training_data.message import Message

logger = logging.getLogger(__name__)

if typing.TYPE_CHECKING:
    from rasa.nlu.model import Metadata

class JiebaTokenizerCustom(JiebaTokenizer):
    """This tokenizer is a wrapper for Jieba (https://github.com/fxsjy/jieba)."""
    def tokenize(self, message: Message, attribute: Text) -> List[Token]:
        import jieba

        text = message.get(attribute)

        tokenized = jieba.tokenize(text)
        tokens = [Token(word, start) for (word, start, end) in tokenized if word.strip()]

        return self._apply_token_pattern(tokens)

ljcljc commented 2 years ago

Hi @ljcljc as follows:

import glob
import logging
import os
import shutil
import typing
from typing import Any, Dict, List, Optional, Text

from rasa.nlu.components import Component
from rasa.nlu.tokenizers.jieba_tokenizer import JiebaTokenizer
from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer
from rasa.shared.nlu.training_data.message import Message

logger = logging.getLogger(__name__)

if typing.TYPE_CHECKING:
    from rasa.nlu.model import Metadata

class JiebaTokenizerCustom(JiebaTokenizer):
    """This tokenizer is a wrapper for Jieba (https://github.com/fxsjy/jieba)."""
    def tokenize(self, message: Message, attribute: Text) -> List[Token]:
        import jieba

        text = message.get(attribute)

        tokenized = jieba.tokenize(text)
        tokens = [Token(word, start) for (word, start, end) in tokenized if word.strip()]

        return self._apply_token_pattern(tokens)

Thanks a lot for the code, it works for me now. One more question about this, I saw the reply by @dakshvar22 about the potential pitfall for entities. Do you have any concern about that? I want to use it in production and want to know about the possible issues it might have.

gongshaojie12 commented 2 years ago

Hi, @ljcljc Since I don't currently have the need for slot filling, I didn't consider the entity recognition. If you have research in the entity, please share your thoughts, thank you!

ljcljc commented 2 years ago

I don't have deep research in entity filling now. But I want to use it in my project, will let you know once I find any issue after deployment. Thanks for you work.

RasaHQ / rasa

Chinese whitespace error #7910

Please also check out the docs and the forum in case your issue was raised there too 🤗