SimGus / Chatette

A powerful dataset generator for Rasa NLU, inspired by Chatito
MIT License
318 stars 56 forks source link

Duplicate strings in output #10

Closed netcarver closed 5 years ago

netcarver commented 5 years ago

Observation

It's possible to generate multiple exact duplicate phrases using the DSL.

For example, this input file...

%[greet]
    ~[&greet] ~[&bot?]

~[greet]
    {hi/hello/howdy/greetings/good morning/good day/good evening}

~[bot]
    hal
    bot

Gives this output (truncated to first two entries)...

{
  "rasa_nlu_data": {
    "common_examples": [
      {
        "entities": [],
        "intent": "greet",
        "text": "hi"
      },
      {
        "entities": [],
        "intent": "greet",
        "text": "hi"
      },
      {
         ...

Suggestion

As it isn't always going to be immediately obvious where DSLs like this are going to generate a duplicate phrase, I'd suggest duplicates be stripped before generating the Json output.

Many thanks for your consideration.

SimGus commented 5 years ago

This can indeed be a problem. Stripping the duplicates before writing the outputs seems to be the best solution, though it would be problematic in case we wanted a certain number of examples for a given intent. For example:

~[greet](3)
   whatever

asks for three different examples. If it is possible to generate three different examples, they should be generated.

I'll strip the duplicates as you said in case no number was given and I'll see what I'll do when a number is given. I don't have much time at the moment, so I'll do that ASAP.

Thanks for your suggestion :)

SimGus commented 5 years ago

You were actually right, this was a bug due to the changing of the letter case in some situations which duplicated some examples. This is fixed now (in all cases) and is available on the branch dev. It will be merged in master and available on PyPI shortly.

If this happens again, this is either another bug I did not encounter, or a mistake in the template file.

Anyway, thanks for your bug report :)