tokenization failed in rasa 2.0

XiaoLiuAI commented 3 years ago

Rasa version: 2.2.6

Rasa SDK version (if used & relevant):

Rasa X version (if used & relevant):

Python version: 3.7

Operating system (windows, osx, ...): osx, ubuntu

Issue: Currently, rasa hf_transformer implementation use whitespace tokenizer to make tokenization then use transformers tokenizer to make subtokens. This will lead to the failure of languages like Chinese, which does not have whitespace. For intent classification, it does not matter. However, for NER, entities are based on tokens, which are split by whitespace, instead of subtokens. It means none of entities will be extracted since nearly all the entities are not aligned.

I think it is a quite serious fundamental algorithm design problem.

Error (including full traceback):

Command or request that led to error:

Content of configuration file (config.yml) (if relevant):

language: zh
pipeline:
  - name: HFTransformersNLP
    # Name of the language model to use
    model_name: "bert"
    # Pre-Trained weights to be loaded
    # model_weights: "hfl/chinese-roberta-wwm-ext"
    model_weights: "bert-base-chinese"

    # An optional path to a specific directory to download and cache the pre-trained model weights.
    # The `default` cache_dir is the same as https://huggingface.co/transformers/serialization.html#cache-directory .
    cache_dir: null
  - name: LanguageModelTokenizer
    intent_tokenization_flag: true
    intent_split_symboll: "-"
  - name: LanguageModelFeaturizer

Content of domain file (domain.yml) (if relevant):

sara-tagger commented 3 years ago

Thanks for the issue, @rgstephens will get back to you about it soon!

You may find help in the docs and the forum, too 🤗

koernerfelicia commented 3 years ago

Hi @XiaoLiuAI, you're correct that this setup is limiting and not ideal for languages that aren't whitespace-tokenized: this is one of the reasons that the HFTransformersNLP component has been deprecated. We recommend you use the LanguageModelFeaturizer component instead, it will behave the same way, and you can specify the desired Hugging Face model with the same syntax as the HFTransformersNLP used. You can use any Tokenizer component in the pipeline before LanguageModelFeaturizer, so you should be able to select a Tokenizer that fits your needs and doesn't do whitespace tokenization. If you're confused on how to do this, please check the docs here, and feel free to ask a question in the forum if you get stuck (you can tag me there, my username is felicia)

rgstephens commented 3 years ago

Closing this issue, if there are any follow-up questions, please post on the forum.

RasaHQ / rasa

tokenization failed in rasa 2.0 #7623

You may find help in the docs and the forum, too 🤗