Closed sibbsnb closed 3 years ago
Hi @sibbsnb I just sent you an email! (Will paste below for reference)
Now that I see the log, I am wondering if it is indeed something else happening here. As the epochs have already started. My best guess now would be the following: As you are training on MacOS you are actually not using GPU training right? I think that might be the actual issue here right now. The training is just very processing intensive. However, would you mind trying out the branch batch-ted-labels as described in my email? Although that change was mainly for gpu memory, it also speeds up training.
Mail:
TLDR: I think in 3 weeks from now e2e training performance will be much improved, but you can use some of these improvements already on dev branches
Here's how:
The problem you have been describing with the training hanging is related to pre-processing speed. I assume what you are seeing is that after Processed trackers: 100% nothing happens. Is that right? In this case it is actually not hanging but currently featurizing the dialogue trackers. This step can take a while, especially when using a slower featurizer such as LanguageModelFeaturizer. You can verify it is not hanging by seeing that main memory should be steadily increasing. You gain some speed here by slimming down some featurizers. However, we are currently working on getting rid of this pre-processing as a separate step altogether and have it run in parallel during training using multiple data preparation workers.
During training the actual training, there is also an GPU memory issue on main branch that we have already fixed on branch batch-ted-labels. So if you want to train with many stories right now, you will definitely need this branch. It will prevent out of memory during training and speed up training too. To use the change on this branch you will also have to set the parameter "label_batch_size". This parameter will set the number of negative "candidates" during training (from all the possible labels, which tend to be thousands in a large e2e set). This is also related to number_of_negative_examples which sets the number of "actual negative examples" used in the loss (sampled from the "candidates", which where sampled formerly). In essence, label_batch_size should be larger than number_of_negative_examples. A good starting point might be label_batch_size = 64, and number_of_negative_examples = 20. We haven't done extensive testing on hyperparameters for classification performance on large e2e setups, so any findings would be useful for further improvements and setting sensible defaults.
potentially related to #8734 #8803
To, this looks like an issue that hasn't been updated, but perhaps some progress has actually been made behind the scenes? Until recently, the issue wasn't in anyone's project board. In any case, it doesn't feel like a sprint issue right now -- particularly because it's unclear what should be done short-term on it and even what the current status and problem is. However, someone more familiar with the issue might be able to update it so that these things are clear, and then it can be picked up...
I am closing this issue as this particular problem was related to training a large dataset on CPU, which looked like nothing was happening. When OP used GPU this problem didn't persist. However, other problems that we were aware of regarding e2e popped up, so I pointed him to the gpu memory fix branch and advised to use --augmentation 0
when starting the training. Haven't heard back since then, but sent a follow up today.
Rasa version: 2.5.0
Rasa SDK version (if used & relevant): 2.5.0
Python version: 3.7.4
Operating system (windows, osx, ...): Darwin-19.6.0-x86_64-i386-64bit
Issue: Trying to train 26,590 examples using end to end learning but training is not starting. It works with smaller datasets.
Error (including full traceback):
Command or request that led to error:
Content of configuration file (config.yml) (if relevant):
Content of domain file (domain.yml) (if relevant):