Closed aeshky closed 1 year ago
Exalate commented:
Ghostvv commented:
swap the token with oov with some probability. If we have a lot of data, this might be better.
what is the reasoning for probability? to reduce amount of additional training data?
Exalate commented:
aeshky commented:
to reduce amount of additional training data
Yes. I think even after "partial dataset loading" is implemented, we still don't want to double the size of the dataset if it's very large (thinking about resources).
➤ Maxime Verger commented:
:bulb: Heads up! We're moving issues to Jira: https://rasa-open-source.atlassian.net/browse/OSS.
From now on, this Jira board is the place where you can browse (without an account) and create issues (you'll need a free Jira account for that). This GitHub issue has already been migrated to Jira and will be closed on January 9th, 2023. Do not forget to subscribe to the corresponding Jira issue!
:arrow_right: More information in the forum: https://forum.rasa.com/t/migration-of-rasa-oss-issues-to-jira/56569.
Problem Description: Entity extraction overfits to the tokens in the training data, instead of learning the linguistic pattern around the tokens.
Example: If the training data contains
My name is <span class="error">[Sarah]</span>(name)
, then entity extraction learns thatSarah
is a name. However, when presented with the sentenceMy name is Bob
and it hasn't seenBob
in the training data, then it fails to extract it as a name. Ideally, entity extraction should learn the linguistic patternMy name is <X>
and thatSarah
is a name.Solution Overview: Replace entity tokens with an oov tag before training. Because the tokens themselves are useful, we don't want to discard them. Instead we can do one of the following:
My name is <span class="error">[Sarah]</span>(name)
andMy name is <span class="error">[oov]</span>(name)
.We need to consider how to present this to users. We can to the
config
file a new option "OOV entities" under which users list all the entities to augment with OOV.Test the Feature: Create a training set with some linguistic patterns that you want to learn and some tokens that you annotate. Then create a test set with:
For each case, give the percentage of instances that are correctly handled. Explain when it works and when it doesn't.
Further Examples to Consider: We might be able to handle this:
I want to travel from London to Moscow
=> Learn that London and Moscow are departure and destination cities, and learn the patternI want to travel from <departure_city> to <destination_city>
. (Useful if users want to define entities as slots).Another case (although this might interact with forms?):
I like sushi but my friend prefers Thai, so let's go with that
=> Learn that sushi and Thai are cuisines, and thatI like <cuisine> but my friend prefer <cuisine>, so let's go with that
means that the slot should be filled with the second token.Related Issues: Entity role and group masking: https://github.com/RasaHQ/rasa/pull/7894