Closed tabergma closed 3 years ago
One observation:
During rasa train
the first thing that happens is
TrainingDataImporter.load_from_config(
config, domain, training_files
)
This ends up checking if the files match the expected format. E.g. it reads
the actual file content to then checks if the given content matches one of the
valid Rasa formats.
The problem: The function guess_format
reads the content of a file multiple times in case
it is yaml. This takes forever! (> 1 hours for the Advising dataset)
I used the library guppy to print out the memory consumption at different points in the code.
Observations from doing rasa train
with the Advising dataset:
After reading the domain and the config file ~400 MB are used.
After loading all the training data ~ 430 MB are used.
... (lost the console output as the instance run out of memory)
File sizes of train.yml
file of the different large datasets:
Advising - 16 MB
Ubuntu - xx MB
MultiWOZ - 15 MB
The Advising and the MultiWOZ dataset fit into memory before featurization starts.
GoogleCloud instance had 4 vCPUs and 15 GB memory.
Related to https://github.com/RasaHQ/rasa/issues/6836
In order to verify if our proposed solution will work, we need to investigate the memory consumption. Clarify the following questions:
TrainingData
with tokens plus a portion of the data featurized?Especially look at the Advising Corpus (can be found here). But also other datasets listed in our training data repository can be used for the investigation.