RasaHQ / rasa

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
https://rasa.com/docs/rasa/
Apache License 2.0
18.89k stars 4.63k forks source link

Handle OOV entity values through data augmentation #8339

Closed aeshky closed 1 year ago

aeshky commented 3 years ago

Problem Description: Entity extraction overfits to the tokens in the training data, instead of learning the linguistic pattern around the tokens.

Example: If the training data contains My name is <span class="error">[Sarah]</span>(name), then entity extraction learns that Sarah is a name. However, when presented with the sentence My name is Bob and it hasn't seen Bob in the training data, then it fails to extract it as a name. Ideally, entity extraction should learn the linguistic pattern My name is <X> and that Sarah is a name.

Solution Overview: Replace entity tokens with an oov tag before training. Because the tokens themselves are useful, we don't want to discard them. Instead we can do one of the following:

  1. duplicate each user utterance and replace the entity tokens with oov. At training time, the model will see My name is <span class="error">[Sarah]</span>(name) and My name is <span class="error">[oov]</span>(name).
  2. swap the token with oov with some probability. If we have a lot of data, this might be better.

We need to consider how to present this to users. We can to the config file a new option "OOV entities" under which users list all the entities to augment with OOV.

Test the Feature: Create a training set with some linguistic patterns that you want to learn and some tokens that you annotate. Then create a test set with:

  1. the same linguistic patterns in the training data, but with new tokens.
  2. new linguist pattern but using tokens from the training data.

For each case, give the percentage of instances that are correctly handled. Explain when it works and when it doesn't.

Further Examples to Consider: We might be able to handle this: I want to travel from London to Moscow => Learn that London and Moscow are departure and destination cities, and learn the pattern I want to travel from <departure_city> to <destination_city>. (Useful if users want to define entities as slots).

Another case (although this might interact with forms?): I like sushi but my friend prefers Thai, so let's go with that => Learn that sushi and Thai are cuisines, and that I like <cuisine> but my friend prefer <cuisine>, so let's go with that means that the slot should be filled with the second token.

Related Issues: Entity role and group masking: https://github.com/RasaHQ/rasa/pull/7894

Ghostvv commented 3 years ago

Exalate commented:

Ghostvv commented:

swap the token with oov with some probability. If we have a lot of data, this might be better.

what is the reasoning for probability? to reduce amount of additional training data?

aeshky commented 3 years ago

Exalate commented:

aeshky commented:

to reduce amount of additional training data

Yes. I think even after "partial dataset loading" is implemented, we still don't want to double the size of the dataset if it's very large (thinking about resources).

sync-by-unito[bot] commented 1 year ago

➤ Maxime Verger commented:

:bulb: Heads up! We're moving issues to Jira: https://rasa-open-source.atlassian.net/browse/OSS.

From now on, this Jira board is the place where you can browse (without an account) and create issues (you'll need a free Jira account for that). This GitHub issue has already been migrated to Jira and will be closed on January 9th, 2023. Do not forget to subscribe to the corresponding Jira issue!

:arrow_right: More information in the forum: https://forum.rasa.com/t/migration-of-rasa-oss-issues-to-jira/56569.