crownpku / Rasa_NLU_Chi

Turn Chinese natural language into structured data 中文自然语言理解
Apache License 2.0
1.51k stars 422 forks source link

entities must span whole tokens. Wrong entity end. #122

Open xyiiinexg3 opened 2 years ago

xyiiinexg3 commented 2 years ago

现象 在输入命令行后:rasa train -c config/config.yml --data data/training_dataset_1660793545.json data/stories.md --out models/movie --domain config/domain.yml --num-threads 5 --augmentation 100 -vv。 会出现类似以下的warning提示: C:\Users\26282\miniconda3\envs\rasa2formovieQA\lib\site-packages\rasa\shared\utils\io.py:93: UserWarning: Failed to use example '郭富城表演过哪些喜剧电影' to train MITIE entity extractor. Example will be skipped.Error: Invalid entity {'end': 10, 'entity': 'genre', 'start': 8, 'value': '喜剧'} in example '郭富城表演过哪些喜剧电影': entities must span whole tokens. Wrong entity end. 这导致在后面模型跑起来的时候,识别不出genre这种实体(喜剧、动画等等)。

训练模型的数据 {"text":"方中信表演动画电影有哪些","intent":"search_person_genre_movie","entities":[{"end":3,"entity":"person","start":0,"value":"方中信"},{"end":7,"entity":"genre","start":5,"value":"动画"}]}

config.yml 有设置jieba分词的用户词典 pipeline:

image

xyiiinexg3 commented 2 years ago

我统计了下,在genre词典中,只有动画、恐怖、喜剧、科幻这四种,不能识别出来。请问这是为什么呀?