Closed tabergma closed 1 year ago
Hi @tabergma .
I am trying to extract any person's name from a sentence using oov
and CountVectorFeaturizer
by following Sara bot. I see that it's not working as expected i.e. some names are not getting extracted even if the sentence is given as it is in the training data and I'm getting the following logs. Is it because of what you have mentioned above?
## intent:inform
- My name is [James](name)
- my name is [Leota](name)
- Ok, it is [Minna](name)
- Its [Donette](name)
- It is [Abel](name)
- My name is oov
- my name is oov
- Ok, it is oov
- Its oov
- It is oov
- oov
- [Louis](name)
- [Josephine](name)
- [Lenna](name)
- [Mitsue](name)
- [Sage](name)
- [Kris](name)
- [Kiley](name)
- [Graciela](name)
My config.yml
language: en
pipeline:
- name: SpacyNLP
case_sensitive: False
- name: SpacyTokenizer
- name: SpacyFeaturizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
OOV_token: oov
token_pattern: (?u)\b\w+\b
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
- name: EntitySynonymMapper
policies:
- name: MemoizationPolicy
- name: TEDPolicy
max_history: 5
epochs: 100
- name: MappingPolicy
- name: FormPolicy
Please let me know if I am doing anything wrong here.
@Akhil-YS Thanks for raising the issue. Please ask your question next time in the forum. Thanks. From the logs it looks like that you are not using the latest version of Rasa. I recommend to update to the latest version and try again. We already fixed some issues that might cause this on your version.
Hi @tabergma. Thank you for your reply. I updated rasa to 1.10.11 as you suggested and tried again. I got the same output again. I will stop extending this here and tag you on the forum post. Could you please help me out with this? Thank you.
➤ Maxime Verger commented:
:bulb: Heads up! We're moving issues to Jira: https://rasa-open-source.atlassian.net/browse/OSS.
From now on, this Jira board is the place where you can browse (without an account) and create issues (you'll need a free Jira account for that). This GitHub issue has already been migrated to Jira and will be closed on January 9th, 2023. Do not forget to subscribe to the corresponding Jira issue!
:arrow_right: More information in the forum: https://forum.rasa.com/t/migration-of-rasa-oss-issues-to-jira/56569.
Description of Problem: The
CountVectorFeaturizer
checks if the givenOOV_token
is present in the data or not. If it is not present it logs a warning. When a special tokenizer is used, for exampleConveRTTokenizer
, tokens get split into sub-tokens. The OOV token might be also split into sub-tokens, for example,ConveRTTokenizer
splitsoov
intooo
andv
. In that case the check if the OOV token is present in the training data would fail as the tokenoov
is not in the list [oo
,v
].Overview of the Solution: Two ideas:
CountVectorsFeaturizer
and take the start and end positions of tokens into account.