Handle large NLU markdown files - Githubissues

RasaHQ / rasa

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

https://rasa.com/docs/rasa/

Apache License 2.0

18.92k stars 4.63k forks source link

Handle large NLU markdown files #5628

Closed kamyarghajar closed 3 years ago

kamyarghajar commented 4 years ago

Description of Problem: Hey, First of all, thanks for the great Rasa system that you have built. Actually, we have a large dataset to train the NLU model and it seems it doesn't handle a large file especially when it comes to efficiently consume memory as our generated markdown data is about 3GB and my machine has 32GB of memory (plus 32GB swap) and high-end processing power. It consumes so much of the memory that we should stop the process as it seems completely stuck with no logs at all from the beginning of the pipeline. It happens even with a 250MB markdown on the same machine on the CRF extraction part but not from the beginning and it is not good. Also discussed in a forum topic.

Overview of the Solution: We need to have some sorts of batching mechanism to handle such large data, or the ability to use multiple markdown files, if possible, but not to train on them parallelly, finally trained into a single model file, so it can use the memory more properly as 32GB(+32GB swap) of ram is a lot and 3GB (or even 250MB) of markdown data does not seem to be very large in comparison to state-of-the-art datasets out there.

Definition of Done:

Handle large NLU datasets by batching or using split files sequentially
Export a single model for the whole dataset

amn41 commented 4 years ago

Hi @KamyarGhajar ! I'm curious where this data comes from. How many intents and entities do you have, and how did you get 3Gb of text annotated?

I suspect that you used a script to generate all this training data. I really don't think that generating 3Gb of synthetic examples is a good idea.

kamyarghajar commented 4 years ago

Hi, @amn41 and thanks for the response. Yes, I have mentioned that the data is generated. The data is generated from different address representation types on map data for address geocoding (parse address in conversations for map search) so all of the places should have been met in the data as the model should put out the tags such as restaurant, neighborhood, etc.and intents such as highway search, lodging search, etc. So far we have 6 distinct intents (to be 25 soon) and 12 entities in the sentences, and also about at most 22 kinds of address sentences generated for each place on the map of a sample country. As Rasa NLU was a perfect solution to be used as the sequence tagger and intent extractor, we decided to use it in our stack.

amn41 commented 4 years ago

ok, got it. I think it's a much better approach to use lookup tables than to generate all possible utterances

kamyarghajar commented 4 years ago

@amn41 is it okay to use lookup tables when it is possible for instance we may have a particular name in that table for a restaurant that is the same as the name of a square or neighborhood that sometimes the name should be tagged as a restaurant and sometimes a neighborhood or metro station? Does it work that way as I guess lookup tables are not getting part in the training pipeline at all?

amn41 commented 4 years ago

yes that should work, the lookup tables are used as extra information by the model, so you can have entries in your lookup table that could appear in multiple entities

kamyarghajar commented 4 years ago

Hi @amn41 , the idea of large lookups instead of a large nlu.md file did not work for two main reasons. First, the precision of detecting the correct entity tags is dramatically degraded in comparison to training with the same data but all generated into separate utterances in nlu.md. Second, in run time the process with a DIET model on /model/parse http api (rasa-1.10.1-full docker with 12 workers + redis lockstore) takes so much time with too many timeouts (tested with wrk2 tool with 64 connections and 16 threads in 1 minute). I guess there should be a way to make this work as the lookup tables seem inefficient in this case.

amn41 commented 4 years ago

thanks for the feedback, @KamyarGhajar ! There is no good reason why the model should perform worse with a lookup table, so that sounds like a possible bug. Are you able to share a minimal example that reproduces this?

cc @dakshvar22 for things to include in regression tests

kamyarghajar commented 4 years ago

Thanks @amn41 for the response. Of course I can share the address dataset with you but the data is in Farsi (Persian) language. Is rasa team able to comprehend and use it for test purposes?

amn41 commented 4 years ago

we will try out best :) you can find me email on my github profile if you want to send it

wochinge commented 3 years ago

This will be fixed by training in steps (https://github.com/RasaHQ/rasa/issues/6836)

wochinge commented 3 years ago

Closing this as duplicate of https://github.com/RasaHQ/rasa/issues/6836