SimGus / Chatette

A powerful dataset generator for Rasa NLU, inspired by Chatito
MIT License
320 stars 56 forks source link

Even distribution of strings with imbalanced sub-rules ? #20

Closed srinidhigoud closed 5 years ago

srinidhigoud commented 5 years ago

If one of my two sub-rules has only one word while the other one has 99 words and when I generate 100 strings, I want to ensure 50% are from either of the sub rule. Is there a way to do this?

SimGus commented 5 years ago

This is the default behavior (and the only available behavior at the moment): when generating a unit or a choice and several rules are available, the program will choose one of the rules at random whatever its length. For example, for the alias:

~[alias]
   first rule
   second rule with a lot of words

the program will choose first rule 50% of the time and second rule with a lot of words the rest of the time.

I'm thinking of adding other behaviors for the choice of rule to generate (e.g. taking into account the number of examples that the rule can generate, or taking the length of the rule into account), feel free to give me ideas if you can think of other useful behaviors.

Cheers