SimGus / Chatette

A powerful dataset generator for Rasa NLU, inspired by Chatito
MIT License
319 stars 56 forks source link

very long generation time #23

Open muhnashX opened 5 years ago

muhnashX commented 5 years ago

Thanks for putting time and effort in such an amazing project, Great work!

Currently i'm using Chatette in a project (Generation modifiers are awesome), one problem though, the generation takes too much time!

i'm generating about ~35K sentences

Statistics:

Parsed files: 29
Declared units: 80 (80 variations)
    Declared intents: 8 (8 variations)
    Declared slots: 17 (17 variations)
    Declared aliases: 55 (55 variations)
Parsed rules: 11031

generation takes about ~ 1 hour, while a clone to chatito takes about ~ 10 min

my question is: what's the complexity of chatette? or what aspect of the statistics above, the generation time is directly propotional to? (i noticed that chatette runs on a single core, so i've tried splitting the master file into 4 files and used ray to generate each master file on a separate worker. with this i managed to get the generation time down to ~25 min but still quite an overhead)

if anyone have a thought or advice, i'd be grateful!

SimGus commented 5 years ago

This is weird, I used to generate ~27k sentences in around a few minutes. That being said, I didn't test performances since v1.4.2, and I made a large refactor which might have decreased performances. I'll check again on v1.6.0.

Would you mind answering a few questions so that I can have a better understanding of what is going on here?

To answer your questions, you can see a rule as a tree, where each piece of content would be a node, and branches would be possible choices the generator can make.

For example, the template

%[intent]
    A single rule with an ~[alias] and a [choice|small choice]

~[alias]
    another rule with a [choice|small choice]
    a third rule

could be seen as RuleTree

To generate an example, Chatette traverses such a tree. Therefore, several factors can increase generation time:

There are a few things I would suggest you do to try and decrease the execution time, aside from the obvious suggestion (e.g. closing all resource intensive application when running Chatette):

Of course, those suggestions are just workarounds, having Chatette execute for an hour for "only" 35k examples is not acceptable. Improving the performance of Chatette is the real solution here. I will do it but it will certainly take time.

On a side note, if you are using Chatette for machine learning tasks, be careful that you can easily have your models overfit with such a high number of examples.

I hope this helps and thank you for your kind message :)

muhnashX commented 5 years ago

Apologies for late reply, things came up and couldn't access the machine on which i'm experimenting. First of all, Thank you for the detailed reply, and yes it helped!! (especially the second half) regarding your questions:

I assume you are using a fairly normal computer but I prefer to ask, is your computer particularly slow?

Well, it's a Thinkpad with Core i7-8565U CPU and 8 GBs of RAM.

Can you confirm you are using Chatette v1.6.0?

Yes! i'm using version 1.6.0

Which part of the process is especially slow: parsing, generation or writing the outut file(s)?```

The generation i think, because the parsing is done quickly and i'm left with the prompt: [DBG] Generating training examples...

Are your template files particularly convoluted? By that I mean, do you often have very long chains of aliases used inside other aliases, used in other aliases, and so on? Also, do you use a lot of modifiers?

Yes kind of, a typical intent definition is basically built of aliases, some of them are nested up to 4 levels. i.e [high_level_alias --> sub_alias_lvl1 --> sub_alias_lvl2 --> small_alias that contain slots] i think here's some work that can be done (flatting and splitting things a bit)

Could you measure whether it is the memory consumption or the CPU utilization that is the limiting factor? Basically, when you run Chatette, is most of your RAM being used or is the CPU core that it is executing on running at full throttle?

i think it's CPU, here's a screenshot from htop when generating today: 72418144_1367214376785629_2517768549708070912_n this run took about ~50min

muhnashX commented 5 years ago

What i'm going to do is to try to refactor my templates as much as i can, but without hurting the modularity and readability as much (chatette winning point). i would pay time in exchange for nicer syntax as it would be easier to maintain and update. and regarding performance optimization of v1.6.0 it would be awesome if i can help!! but i lack the experience in that matter. that's why i went for dealing with it as a black box and try to multi process on a high level or use ray.

SimGus commented 5 years ago

Thanks for your replies :)

I think I see what the problem is: since v1.6.0, I've been trying to use caching during generation to improve generation time, but for big templates, it will use the whole RAM which makes it very slow. I assumed this would be a problem for much larger files but apparently I was wrong.

I'll add a way to disable that (or just turn it down) in the next update. It is likely it will take time as I don't have much time at the moment, so I hope you can cope with the current version. If you can, trying the previous version (v1.5.2) could help I think.

Thanks again for the heads up!

SimGus commented 5 years ago

I just released a patch (v1.6.1) which tackles the issue temporarily: the problem came from the caching strategy which is way too aggressive. That patch disables caching when there are more than 50 units (aliases, slots and intents) declared. I could record about a 4x increase in the execution time, for medium size templates (you can find my benchmarks here).

Of course, this is just a temporary solution, I plan to tackle the issue in a cleaner and more robust way in future releases. I will keep this issue open until I am satisfied with the caching strategy, and more generally the execution time.