SimGus / Chatette

A powerful dataset generator for Rasa NLU, inspired by Chatito
MIT License
319 stars 56 forks source link

Handling several slots in the same sentence #34

Closed Asma-droid closed 4 years ago

Asma-droid commented 4 years ago

Hi all,

Thanks to this great library :-)

I would like to use several slots within a same sentence but the produced json file does not pick up the right start and end of the slots.

Below a simple example:

**** txt_file *** %&ask_toilet where the @[toilet#singular] is @[please]?

@[toilet#singular] toilet loo @[please] please plz

* json result ***

{ "rasa_nlu_data": { "common_examples": [ { "entities": [ { "end": 13, "entity": "toilet", "start": 10, "value": "loo" }, { "end": 42, "entity": "please", "start": 36, "value": "please" } ], "intent": "ask_toilet", "text": "Where the loo is please?" }, { "entities": [ { "end": 13, "entity": "toilet", "start": 10, "value": "loo" }, { "end": 39, "entity": "please", "start": 36, "value": "plz" } ], "intent": "ask_toilet", "text": "where the loo is plz?" }, { "entities": [ { "end": 16, "entity": "toilet", "start": 10, "value": "toilet" }, { "end": 39, "entity": "please", "start": 36, "value": "plz" } ], "intent": "ask_toilet", "text": "where the toilet is plz?" } ], "entity_synonyms": [], "lookup_tables": [], "regex_features": [] } }

Any idea please ???

SimGus commented 4 years ago

Hi,

What you report seems related to issue #22 which was fixed in v1.6.1. Using the template you provided, I cannot reproduce the results you get.

Would you mind giving a little more information about the environment you run chatette in? Namely, I'd need to know:

Thanks in advance :)

Asma-droid commented 4 years ago

Thanks a lot. It works fine!! I have just one other problème. I use the french language however the actual code does not support UTF8 encoding. Any help please!!!

SimGus commented 4 years ago

You're welcome!

For Unicode encoding, I guess you are running Windows, which is likely to encode your files using Windows-1252 encoding. Try using a file editor that allows you to save your template files in UTF-8 and feeding them to Chatette should work without a problem. Using a recent version of python (>= 3.5 I would say) could also help.

Just so you know, I speak French and I have no problem running Chatette to produce French datasets by running Linux (whatever the Python version) and encoding all my files in UTF-8.

Hope this helps! Feel free to ask if you need help again!

Asma-droid commented 4 years ago

Thanks a lot. It works fine for me.

Just another question. How to avoid redendency. In fact, if i put 100 as the augmentation parameter, the algorithm generate redandant sentences.

SimGus commented 4 years ago

What exactly do you mean by "redundancy"? The generated data should never contain duplicates, so if you get the same generated sentence twice (or more times), this is a bug.

If you mean your sentences are too close to each other, this simply depends on your templates. A good way to have variation in the generated sentences is to have a lot of rules in your aliases, slots and templates. You can take a look at different examples on the repo if you want to see how to make good templates.

Cheers!

Asma-droid commented 4 years ago

Hello @SimGus

I am so sory for my late response.

I have tried to use the simple example of the directory toilets and i have set the number of generated sentences to 1000 instead of 3. The algorithme generate duplicated. Have you a solution to avoid that ??

image

SimGus commented 4 years ago

Hey @Asma-droid,

The two sentences are actually not duplicates: one of them starts with an uppercase letter, while the other one starts with a lowercase letter. If you want to avoid generating sentences with different cases, simply remove the ampersand & that are at the beginning of the unit declaration (so for the toilet example, remove the ampersand in %[&ask_toilet](100)).

I hope this helps.

Asma-droid commented 4 years ago

Thank you for this response !

SimGus commented 4 years ago

You're welcome! I will close this issue as it seems fixed.