OOV token and special tokenizers

tabergma commented 4 years ago

Description of Problem: The CountVectorFeaturizer checks if the given OOV_token is present in the data or not. If it is not present it logs a warning. When a special tokenizer is used, for example ConveRTTokenizer, tokens get split into sub-tokens. The OOV token might be also split into sub-tokens, for example, ConveRTTokenizer splits oov into oo and v. In that case the check if the OOV token is present in the training data would fail as the token oov is not in the list [oo, v].

Overview of the Solution: Two ideas:

Move the OOV token to the tokenizer itself. The tokenizer would tokenize the OOV token and add it to the message object.
Use the actual token object in the check in CountVectorsFeaturizer and take the start and end positions of tokens into account.

Akhil-YS commented 4 years ago

Hi @tabergma . I am trying to extract any person's name from a sentence using oov and CountVectorFeaturizer by following Sara bot. I see that it's not working as expected i.e. some names are not getting extracted even if the sentence is given as it is in the training data and I'm getting the following logs. Is it because of what you have mentioned above?

## intent:inform
- My name is [James](name)
- my name is [Leota](name)
- Ok, it is [Minna](name)
- Its [Donette](name)
- It is [Abel](name)
- My name is oov
- my name is oov
- Ok, it is oov
- Its oov
- It is oov
- oov
- [Louis](name)
- [Josephine](name)
- [Lenna](name)
- [Mitsue](name)
- [Sage](name)
- [Kris](name)
- [Kiley](name)
- [Graciela](name)

My config.yml

language: en
pipeline:
  - name: SpacyNLP
    case_sensitive: False
  - name: SpacyTokenizer
  - name: SpacyFeaturizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
    OOV_token: oov
    token_pattern: (?u)\b\w+\b
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper

policies:
  - name: MemoizationPolicy
  - name: TEDPolicy
    max_history: 5
    epochs: 100
  - name: MappingPolicy
  - name: FormPolicy

Please let me know if I am doing anything wrong here.

tabergma commented 4 years ago

@Akhil-YS Thanks for raising the issue. Please ask your question next time in the forum. Thanks. From the logs it looks like that you are not using the latest version of Rasa. I recommend to update to the latest version and try again. We already fixed some issues that might cause this on your version.

Akhil-YS commented 4 years ago

Hi @tabergma. Thank you for your reply. I updated rasa to 1.10.11 as you suggested and tried again. I got the same output again. I will stop extending this here and tag you on the forum post. Could you please help me out with this? Thank you.

sync-by-unito[bot] commented 1 year ago

➤ Maxime Verger commented:

:bulb: Heads up! We're moving issues to Jira: https://rasa-open-source.atlassian.net/browse/OSS.

From now on, this Jira board is the place where you can browse (without an account) and create issues (you'll need a free Jira account for that). This GitHub issue has already been migrated to Jira and will be closed on January 9th, 2023. Do not forget to subscribe to the corresponding Jira issue!

:arrow_right: More information in the forum: https://forum.rasa.com/t/migration-of-rasa-oss-issues-to-jira/56569.

RasaHQ / rasa

OOV token and special tokenizers #5383