RasaHQ / rasa

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
https://rasa.com/docs/rasa/
Apache License 2.0
18.95k stars 4.64k forks source link

OOV token and special tokenizers #5383

Closed tabergma closed 1 year ago

tabergma commented 4 years ago

Description of Problem: The CountVectorFeaturizer checks if the given OOV_token is present in the data or not. If it is not present it logs a warning. When a special tokenizer is used, for example ConveRTTokenizer, tokens get split into sub-tokens. The OOV token might be also split into sub-tokens, for example, ConveRTTokenizer splits oov into oo and v. In that case the check if the OOV token is present in the training data would fail as the token oov is not in the list [oo, v].

Overview of the Solution: Two ideas:

Akhil-YS commented 4 years ago

Hi @tabergma . I am trying to extract any person's name from a sentence using oov and CountVectorFeaturizer by following Sara bot. I see that it's not working as expected i.e. some names are not getting extracted even if the sentence is given as it is in the training data and I'm getting the following logs. Is it because of what you have mentioned above?

image

## intent:inform
- My name is [James](name)
- my name is [Leota](name)
- Ok, it is [Minna](name)
- Its [Donette](name)
- It is [Abel](name)
- My name is oov
- my name is oov
- Ok, it is oov
- Its oov
- It is oov
- oov
- [Louis](name)
- [Josephine](name)
- [Lenna](name)
- [Mitsue](name)
- [Sage](name)
- [Kris](name)
- [Kiley](name)
- [Graciela](name)

My config.yml

language: en
pipeline:
  - name: SpacyNLP
    case_sensitive: False
  - name: SpacyTokenizer
  - name: SpacyFeaturizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
    OOV_token: oov
    token_pattern: (?u)\b\w+\b
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper

policies:
  - name: MemoizationPolicy
  - name: TEDPolicy
    max_history: 5
    epochs: 100
  - name: MappingPolicy
  - name: FormPolicy

image

Please let me know if I am doing anything wrong here.

tabergma commented 4 years ago

@Akhil-YS Thanks for raising the issue. Please ask your question next time in the forum. Thanks. From the logs it looks like that you are not using the latest version of Rasa. I recommend to update to the latest version and try again. We already fixed some issues that might cause this on your version.

Akhil-YS commented 4 years ago

Hi @tabergma. Thank you for your reply. I updated rasa to 1.10.11 as you suggested and tried again. I got the same output again. I will stop extending this here and tag you on the forum post. Could you please help me out with this? Thank you.

sync-by-unito[bot] commented 1 year ago

➤ Maxime Verger commented:

:bulb: Heads up! We're moving issues to Jira: https://rasa-open-source.atlassian.net/browse/OSS.

From now on, this Jira board is the place where you can browse (without an account) and create issues (you'll need a free Jira account for that). This GitHub issue has already been migrated to Jira and will be closed on January 9th, 2023. Do not forget to subscribe to the corresponding Jira issue!

:arrow_right: More information in the forum: https://forum.rasa.com/t/migration-of-rasa-oss-issues-to-jira/56569.