End to End learning not training for large datasets

sibbsnb commented 3 years ago

Rasa version: 2.5.0

Rasa SDK version (if used & relevant): 2.5.0

Python version: 3.7.4

Operating system (windows, osx, ...): Darwin-19.6.0-x86_64-i386-64bit

Issue: Trying to train 26,590 examples using end to end learning but training is not starting. It works with smaller datasets.

Error (including full traceback):

No error. Hangs like below:

  $ rasa train
The configuration for policies and pipeline was chosen automatically. It was written into the config file at 'config.yml'.
2021-06-10 11:02:26 INFO     rasa.model  - Data (domain) for Core model section changed.
2021-06-10 11:02:26 INFO     rasa.model  - Data (messages) for NLU model section changed.
2021-06-10 11:02:26 WARNING  rasa.shared.utils.common  - The end-to-end training is currently experimental and might change or be removed in the future 🔬 Please share your feedback on it in the forum (https://forum.rasa.com) to help us make this feature ready for production.
2021-06-10 11:02:29 INFO     rasa.model  - Data (domain) for Core model section changed.
2021-06-10 11:02:29 INFO     rasa.model  - Data (messages) for NLU model section changed.
Training NLU model...
2021-06-10 11:02:31 INFO     rasa.shared.nlu.training_data.training_data  - Training data stats:
2021-06-10 11:02:31 INFO     rasa.shared.nlu.training_data.training_data  - Number of intent examples: 0 (0 distinct intents)

2021-06-10 11:02:31 INFO     rasa.shared.nlu.training_data.training_data  - Number of response examples: 0 (0 distinct responses)
2021-06-10 11:02:31 INFO     rasa.shared.nlu.training_data.training_data  - Number of entity examples: 0 (0 distinct entities)
2021-06-10 11:02:31 INFO     rasa.nlu.model  - Starting to train component WhitespaceTokenizer
2021-06-10 11:02:32 INFO     rasa.nlu.model  - Finished training component.
2021-06-10 11:02:32 INFO     rasa.nlu.model  - Starting to train component RegexFeaturizer
2021-06-10 11:02:32 INFO     rasa.nlu.model  - Finished training component.
2021-06-10 11:02:32 INFO     rasa.nlu.model  - Starting to train component LexicalSyntacticFeaturizer
2021-06-10 11:02:35 INFO     rasa.nlu.model  - Finished training component.
2021-06-10 11:02:35 INFO     rasa.nlu.model  - Starting to train component CountVectorsFeaturizer
2021-06-10 11:02:35 INFO     rasa.nlu.featurizers.sparse_featurizer.count_vectors_featurizer  - 1869 vocabulary slots consumed out of 2869 slots configured for text attribute.
2021-06-10 11:02:35 INFO     rasa.nlu.featurizers.sparse_featurizer.count_vectors_featurizer  - 2389 vocabulary slots consumed out of 3583 slots configured for action_text attribute.
2021-06-10 11:02:40 INFO     rasa.nlu.model  - Finished training component.
2021-06-10 11:02:40 INFO     rasa.nlu.model  - Starting to train component CountVectorsFeaturizer
2021-06-10 11:02:41 INFO     rasa.nlu.featurizers.sparse_featurizer.count_vectors_featurizer  - 8551 vocabulary slots consumed out of 12826 slots configured for text attribute.
2021-06-10 11:02:42 INFO     rasa.nlu.featurizers.sparse_featurizer.count_vectors_featurizer  - 11213 vocabulary slots consumed out of 16819 slots configured for action_text attribute.
2021-06-10 11:02:50 INFO     rasa.nlu.model  - Finished training component.
2021-06-10 11:02:50 INFO     rasa.nlu.model  - Starting to train component DIETClassifier
2021-06-10 11:02:50 INFO     rasa.nlu.model  - Finished training component.
2021-06-10 11:02:50 INFO     rasa.nlu.model  - Starting to train component EntitySynonymMapper
2021-06-10 11:02:50 INFO     rasa.nlu.model  - Finished training component.
2021-06-10 11:02:50 INFO     rasa.nlu.model  - Starting to train component ResponseSelector
2021-06-10 11:02:50 INFO     rasa.nlu.selectors.response_selector  - Retrieval intent parameter was left to its default value. This response selector will be trained on training examples combining all retrieval intents.
2021-06-10 11:02:50 INFO     rasa.nlu.model  - Finished training component.
2021-06-10 11:02:50 INFO     rasa.nlu.model  - Starting to train component FallbackClassifier
2021-06-10 11:02:50 INFO     rasa.nlu.model  - Finished training component.
2021-06-10 11:02:50 INFO     rasa.nlu.model  - Successfully saved model into '/var/folders/04/3kld7vr90jng2rwbqxvcy02m0000gn/T/tmpdlfdtnys/nlu'
NLU model training completed.
Training Core model...
Processed story blocks: 100%|█| 27045/27045 [00:06<00:00, 3960.92it/s, # tracker
Processed story blocks: 100%|█| 27045/27045 [04:04<00:00, 110.60it/s, # trackers
Processed story blocks: 100%|█| 27045/27045 [05:30<00:00, 81.71it/s, # trackers=
Processed story blocks: 100%|█| 27045/27045 [06:40<00:00, 67.55it/s, # trackers=
Processed trackers: 100%|█| 16262/16262 [00:24<00:00, 662.26it/s, # actions=3438
Processed actions: 34384it [00:11, 3038.31it/s, # examples=26590]
Processed trackers: 100%|█| 16762/16762 [00:24<00:00, 672.27it/s, # actions=3566
Epochs:   0%|                                           | 0/100 [00:00<?, ?it/s]

Command or request that led to error:

rasa train

Content of configuration file (config.yml) (if relevant):

Content of domain file (domain.yml) (if relevant):

sara-tagger commented 3 years ago

Thanks for raising this issue, @melindaloubser1 will get back to you about it soon✨

Please also check out the docs and the forum in case your issue was raised there too 🤗

twerkmeister commented 3 years ago

Hi @sibbsnb I just sent you an email! (Will paste below for reference)

Now that I see the log, I am wondering if it is indeed something else happening here. As the epochs have already started. My best guess now would be the following: As you are training on MacOS you are actually not using GPU training right? I think that might be the actual issue here right now. The training is just very processing intensive. However, would you mind trying out the branch batch-ted-labels as described in my email? Although that change was mainly for gpu memory, it also speeds up training.

Mail:

TLDR: I think in 3 weeks from now e2e training performance will be much improved, but you can use some of these improvements already on dev branches

Here's how:

The problem you have been describing with the training hanging is related to pre-processing speed. I assume what you are seeing is that after Processed trackers: 100% nothing happens. Is that right? In this case it is actually not hanging but currently featurizing the dialogue trackers. This step can take a while, especially when using a slower featurizer such as LanguageModelFeaturizer. You can verify it is not hanging by seeing that main memory should be steadily increasing. You gain some speed here by slimming down some featurizers. However, we are currently working on getting rid of this pre-processing as a separate step altogether and have it run in parallel during training using multiple data preparation workers.

During training the actual training, there is also an GPU memory issue on main branch that we have already fixed on branch batch-ted-labels. So if you want to train with many stories right now, you will definitely need this branch. It will prevent out of memory during training and speed up training too. To use the change on this branch you will also have to set the parameter "label_batch_size". This parameter will set the number of negative "candidates" during training (from all the possible labels, which tend to be thousands in a large e2e set). This is also related to number_of_negative_examples which sets the number of "actual negative examples" used in the loss (sampled from the "candidates", which where sampled formerly). In essence, label_batch_size should be larger than number_of_negative_examples. A good starting point might be label_batch_size = 64, and number_of_negative_examples = 20. We haven't done extensive testing on hyperparameters for classification performance on large e2e setups, so any findings would be useful for further improvements and setting sensible defaults.

twerkmeister commented 3 years ago

potentially related to #8734 #8803

samsucik commented 3 years ago

To, this looks like an issue that hasn't been updated, but perhaps some progress has actually been made behind the scenes? Until recently, the issue wasn't in anyone's project board. In any case, it doesn't feel like a sprint issue right now -- particularly because it's unclear what should be done short-term on it and even what the current status and problem is. However, someone more familiar with the issue might be able to update it so that these things are clear, and then it can be picked up...

twerkmeister commented 3 years ago

I am closing this issue as this particular problem was related to training a large dataset on CPU, which looked like nothing was happening. When OP used GPU this problem didn't persist. However, other problems that we were aware of regarding e2e popped up, so I pointed him to the gpu memory fix branch and advised to use --augmentation 0 when starting the training. Haven't heard back since then, but sent a follow up today.

RasaHQ / rasa

End to End learning not training for large datasets #8858

Please also check out the docs and the forum in case your issue was raised there too 🤗