RasaHQ / rasa

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
https://rasa.com/docs/rasa/
Apache License 2.0
18.92k stars 4.63k forks source link

Scheduled Model Regression Test Failed #8362

Closed github-actions[bot] closed 3 years ago

github-actions[bot] commented 3 years ago

This PR is automatically created by the Scheduled Model Regression Test workflow. Checkout the Github Action Run here.
---
Description of Problem:
Scheduled Model Regression Test failed.
Configuration: BERT + DIET(seq) + ResponseSelector(t2t)
Dataset: Private 2

Definition of done

dakshvar22 commented 3 years ago

@tttthomasssss Just for your reference, I had described here what could be the possible causes.

aeshky commented 3 years ago

Reproducing the error

Error type: GPU running out of memory when loading training data

I reproduced the error by doing the following:

  1. I created a virtual machine on google cloud with the following specs: n1-highmem-2 (2 vCPUs, 13 GB memory) with 1 x NVIDIA Tesla P4 to match our CI set up.

  2. I installed the latest rasa version==2.6.3

  3. I cloned the repository https://github.com/RasaHQ/training-data

  4. I ran the command: rasa train nlu -u <dataset> -c <config>

and got the error: tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[252,223,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Observations while investigating the error

Solution

I took the longest utterance in training.yml and translated it from German to English. I noticed that the utterance was duplicated three times creating one very long utterance that's 1374 characters long (probably too large to load into the GPU memory). I shortened the utterance by deleting the duplication. I then reran the training command on the virtual machine in (1) and it completed with no errors. I also ran the test: rasa test nlu -u <dataset> which also ran with no errors.

Testing the solution

After the test passed on google cloud, I ran it in our CI framework by submitting an empty pull request: https://github.com/RasaHQ/rasa/pull/8849 and the test passed.

tttthomasssss commented 3 years ago

~@samsucik assigned as reviewer.~ Not pull request to review.

aeshky commented 3 years ago

Original estimate: 4 effort point == full sprint. Actual time: one and a half day of work (excluding the time it took me to get setup on google cloud and read the CI framework documentation)