Investigate memory consumption during training

RasaHQ / rasa

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

https://rasa.com/docs/rasa/

Apache License 2.0

18.92k stars 4.63k forks source link

Investigate memory consumption during training #7120

Closed tabergma closed 3 years ago

tabergma commented 4 years ago

In order to verify if our proposed solution will work, we need to investigate the memory consumption. Clarify the following questions:

Does the complete Advising Corpus fit into memory? E.g. just reading the data and tokenizing it? - yes
How much more memory is need to featurize 5%, 10%, x% of the data? Try out different featurizations.
Can we hold the complete Advising Corpus in memory as TrainingData with tokens plus a portion of the data featurized?

Especially look at the Advising Corpus (can be found here). But also other datasets listed in our training data repository can be used for the investigation.

tabergma commented 3 years ago

One observation:

During rasa train the first thing that happens is

TrainingDataImporter.load_from_config(
        config, domain, training_files
    )

This ends up checking if the files match the expected format. E.g. it reads the actual file content to then checks if the given content matches one of the valid Rasa formats. The problem: The function guess_format reads the content of a file multiple times in case it is yaml. This takes forever! (> 1 hours for the Advising dataset)

tabergma commented 3 years ago

I used the library guppy to print out the memory consumption at different points in the code.

Observations from doing rasa train with the Advising dataset: After reading the domain and the config file ~400 MB are used. After loading all the training data ~ 430 MB are used. ... (lost the console output as the instance run out of memory)

tabergma commented 3 years ago

File sizes of train.yml file of the different large datasets: Advising - 16 MB Ubuntu - xx MB MultiWOZ - 15 MB

The Advising and the MultiWOZ dataset fit into memory before featurization starts.

GoogleCloud instance had 4 vCPUs and 15 GB memory.

tabergma commented 3 years ago

Add some finding here.

According to the findings we should be able to load MultiWOZ and Advising dataset.