Closed XiaoLiuAI closed 3 years ago
Hi @XiaoLiuAI, you're correct that this setup is limiting and not ideal for languages that aren't whitespace-tokenized: this is one of the reasons that the HFTransformersNLP
component has been deprecated.
We recommend you use the LanguageModelFeaturizer
component instead, it will behave the same way, and you can specify the desired Hugging Face model with the same syntax as the HFTransformersNLP
used.
You can use any Tokenizer
component in the pipeline before LanguageModelFeaturizer
, so you should be able to select a Tokenizer
that fits your needs and doesn't do whitespace tokenization. If you're confused on how to do this, please check the docs here, and feel free to ask a question in the forum if you get stuck (you can tag me there, my username is felicia
)
Closing this issue, if there are any follow-up questions, please post on the forum.
Rasa version: 2.2.6
Rasa SDK version (if used & relevant):
Rasa X version (if used & relevant):
Python version: 3.7
Operating system (windows, osx, ...): osx, ubuntu
Issue: Currently, rasa hf_transformer implementation use whitespace tokenizer to make tokenization then use transformers tokenizer to make subtokens. This will lead to the failure of languages like Chinese, which does not have whitespace. For intent classification, it does not matter. However, for NER, entities are based on tokens, which are split by whitespace, instead of subtokens. It means none of entities will be extracted since nearly all the entities are not aligned.
I think it is a quite serious fundamental algorithm design problem.
Error (including full traceback):
Command or request that led to error:
Content of configuration file (config.yml) (if relevant):
Content of domain file (domain.yml) (if relevant):