RasaHQ / rasa

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
https://rasa.com/docs/rasa/
Apache License 2.0
18.75k stars 4.62k forks source link

Scheduled Model Regression Test Performance Drops #10415

Closed rasabot closed 2 years ago

rasabot commented 2 years ago
This PR is automatically created by the Scheduled Model Regression Test workflow. Checkout the Github Action Run here.
---
Description of Problem:
Some test performance scores decreased. Please look at the following table for more details.
Dataset: Carbon Bot, Dataset repository branch: main, commit: 819cb7b3cc077753e67178ad022d577f164e99cf
Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 1m56s, train: 4m14s, total: 6m9s
0.7942 (0.00) 0.7529 (0.00) 0.5382 (0.00)
BERT + DIET(seq) + ResponseSelector(t2t)
test: 3m17s, train: 4m51s, total: 8m8s
0.8078 (0.00) 0.7787 (0.00) 0.5298 (0.00)
Sparse + BERT + DIET(bow) + ResponseSelector(bow)
test: 1m58s, train: 5m1s, total: 6m58s
0.7864 (0.00) 0.7529 (0.00) 0.5629 (0.01)
Sparse + BERT + DIET(seq) + ResponseSelector(t2t)
test: 3m32s, train: 5m45s, total: 9m16s
0.7806 (0.00) 0.7880 (0.00) 0.5960 (0.01)
Sparse + DIET(bow) + ResponseSelector(bow)
test: 56s, train: 2m47s, total: 3m42s
0.7495 (0.01) 0.7529 (0.00) 0.5166 (0.01)
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 2m30s, train: 5m20s, total: 7m50s
0.7398 (0.00) 0.7022 (0.00) 0.5166 (-0.02)
Dataset: Hermit, Dataset repository branch: main, commit: 819cb7b3cc077753e67178ad022d577f164e99cf Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 3m15s, train: 21m37s, total: 24m51s
0.8978 (-0.00) 0.7504 (0.00) no data
Sparse + BERT + DIET(bow) + ResponseSelector(bow)
test: 3m22s, train: 26m51s, total: 30m13s
0.8717 (0.00) 0.7504 (0.00) no data
Sparse + DIET(bow) + ResponseSelector(bow)
test: 1m22s, train: 20m58s, total: 22m19s
0.8309 (0.01) 0.7504 (0.00) no data
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 2m57s, train: 16m58s, total: 19m55s
0.8346 (0.00) 0.7582 (-0.00) no data
Dataset: Private 1, Dataset repository branch: main, commit: 819cb7b3cc077753e67178ad022d577f164e99cf Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 2m28s, train: 4m7s, total: 6m35s
0.9096 (0.00) 0.9612 (0.00) no data
BERT + DIET(seq) + ResponseSelector(t2t)
test: 4m5s, train: 4m4s, total: 8m8s
0.9148 (0.00) 0.9717 (0.00) no data
Spacy + DIET(bow) + ResponseSelector(bow)
test: 50s, train: 3m11s, total: 4m1s
0.8420 (0.00) 0.9574 (0.00) no data
Spacy + DIET(seq) + ResponseSelector(t2t)
test: 2m34s, train: 3m44s, total: 6m17s
0.8524 (-0.00) 0.9445 (0.00) no data
Sparse + DIET(bow) + ResponseSelector(bow)
test: 43s, train: 3m34s, total: 4m17s
0.8950 (-0.00) 0.9612 (0.00) no data
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 2m19s, train: 3m49s, total: 6m7s
0.9044 (0.00) 0.9680 (-0.00) no data
Sparse + Spacy + DIET(bow) + ResponseSelector(bow)
test: 57s, train: 4m36s, total: 5m32s
0.8877 (0.00) 0.9574 (0.00) no data
Sparse + Spacy + DIET(seq) + ResponseSelector(t2t)
test: 2m35s, train: 4m39s, total: 7m13s
0.8981 (0.00) 0.9661 (-0.00) no data
Dataset: Private 2, Dataset repository branch: main, commit: 819cb7b3cc077753e67178ad022d577f164e99cf Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 2m25s, train: 11m12s, total: 13m37s
0.8745 (0.00) no data no data
BERT + DIET(seq) + ResponseSelector(t2t)
test: 2m52s, train: 6m27s, total: 9m19s
0.8830 (0.00) no data no data
Spacy + DIET(bow) + ResponseSelector(bow)
test: 53s, train: 5m58s, total: 6m51s
0.7253 (0.00) no data no data
Spacy + DIET(seq) + ResponseSelector(t2t)
test: 1m24s, train: 5m45s, total: 7m8s
0.7833 (-0.00) no data no data
Sparse + DIET(bow) + ResponseSelector(bow)
test: 49s, train: 5m15s, total: 6m3s
0.8573 (0.01) no data no data
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 1m14s, train: 5m24s, total: 6m37s
0.8530 (0.00) no data no data
Sparse + Spacy + DIET(bow) + ResponseSelector(bow)
test: 1m3s, train: 8m10s, total: 9m12s
0.8562 (-0.00) no data no data
Sparse + Spacy + DIET(seq) + ResponseSelector(t2t)
test: 1m28s, train: 7m0s, total: 8m28s
0.8519 (0.00) no data no data
Dataset: Private 3, Dataset repository branch: main, commit: 819cb7b3cc077753e67178ad022d577f164e99cf Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 1m5s, train: 1m8s, total: 2m12s
0.9177 (0.00) no data no data
BERT + DIET(seq) + ResponseSelector(t2t)
test: 1m14s, train: 51s, total: 2m5s
0.8436 (0.00) no data no data
Spacy + DIET(bow) + ResponseSelector(bow)
test: 41s, train: 57s, total: 1m37s
0.6173 (0.00) no data no data
Spacy + DIET(seq) + ResponseSelector(t2t)
test: 52s, train: 48s, total: 1m39s
0.6255 (0.00) no data no data
Sparse + DIET(bow) + ResponseSelector(bow)
test: 33s, train: 1m2s, total: 1m35s
0.8683 (0.00) no data no data
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 45s, train: 47s, total: 1m31s
0.8642 (0.00) no data no data
Sparse + Spacy + DIET(bow) + ResponseSelector(bow)
test: 42s, train: 1m19s, total: 2m1s
0.8477 (0.00) no data no data
Sparse + Spacy + DIET(seq) + ResponseSelector(t2t)
test: 53s, train: 59s, total: 1m51s
0.8601 (0.00) no data no data
Dataset: Sara, Dataset repository branch: main, commit: 819cb7b3cc077753e67178ad022d577f164e99cf Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 5m58s, train: 6m54s, total: 12m51s
0.7174 (0.00) 0.7949 (0.00) 0.7981 (0.00)
BERT + DIET(seq) + ResponseSelector(t2t)
test: 8m16s, train: 5m52s, total: 14m8s
0.7136 (0.00) 0.7925 (0.00) 0.7798 (0.00)
Sparse + BERT + DIET(bow) + ResponseSelector(bow)
test: 6m29s, train: 10m20s, total: 16m49s
0.6996 (0.01) 0.7949 (0.00) 0.8109 (0.00)
Sparse + BERT + DIET(seq) + ResponseSelector(t2t)
test: 8m51s, train: 8m15s, total: 17m6s
0.7001 (-0.00) 0.7878 (-0.00) 0.8031 (0.01)
Sparse + DIET(bow) + ResponseSelector(bow)
test: 2m35s, train: 6m51s, total: 9m26s
0.6644 (-0.00) 0.7949 (0.00) 0.7798 (0.00)
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 5m1s, train: 5m51s, total: 10m51s
0.6794 (0.00) 0.7692 (0.00) 0.7829 (-0.01)
Dialog Policy Configuration Action Level Micro Avg. F1 Conversation Level Accuracy Run Time Train Run Time Test
Rules 0.1266 (0.00) 0.0000 (0.00) 2m31s 1m20s
Rules + AugMemo 0.9149 (0.00) 0.6301 (0.00) 2m33s 1m32s
Rules + AugMemo + TED 0.9709 (-0.00) 0.7432 (-0.01) 46m23s 7m29s
Rules + Memo 0.3860 (0.00) 0.1438 (0.00) 2m29s 1m26s
Rules + Memo + TED 0.9476 (0.00) 0.6370 (0.01) 44m54s 7m24s
Rules + TED 0.9389 (-0.01) 0.5342 (-0.10) 45m48s 7m18s
Dataset: financial-demo, Dataset repository branch: fix-model-regression-tests (external repository), commit: 52a3ad3eb5292d56542687e23b06703431f15ead Configuration repository branch: main Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 28s, train: 51s, total: 1m18s
1.0000 (0.00) 0.8333 (0.00) no data
BERT + DIET(seq) + ResponseSelector(t2t)
test: 54s, train: 1m13s, total: 2m7s
1.0000 (0.00) 0.8333 (0.00) no data
Sparse + BERT + DIET(bow) + ResponseSelector(bow)
test: 28s, train: 1m5s, total: 1m32s
1.0000 (0.00) 0.8333 (0.00) no data
Sparse + BERT + DIET(seq) + ResponseSelector(t2t)
test: 55s, train: 1m19s, total: 2m13s
1.0000 (0.00) 0.8800 (0.00) no data
Sparse + DIET(bow) + ResponseSelector(bow)
test: 19s, train: 43s, total: 1m2s
0.9643 (0.00) 0.8333 (0.00) no data
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 47s, train: 1m6s, total: 1m53s
0.9643 (0.00) 0.8800 (0.00) no data
Dialog Policy Configuration Action Level Micro Avg. F1 Conversation Level Accuracy Run Time Train Run Time Test
Rules 0.7218 (0.00) 0.5417 (0.00) 19s 11s
Rules + AugMemo 1.0000 (0.00) 1.0000 (0.00) 20s 12s
Rules + AugMemo + TED 1.0000 (0.00) 1.0000 (0.00) 7m4s 50s
Rules + Memo 0.9807 (0.00) 0.9167 (0.00) 21s 12s
Rules + Memo + TED 1.0000 (0.00) 1.0000 (0.00) 7m8s 51s
Rules + TED 0.8856 (0.00) 0.6875 (0.00) 6m50s 51s
Dataset: helpdesk-assistant, Dataset repository branch: fix-model-regression-tests (external repository), commit: 62d7ae28585a32b2fbfceb4dd7ce233e53920238 Configuration repository branch: main Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 20s, train: 45s, total: 1m5s
1.0000 (0.00) no data no data
BERT + DIET(seq) + ResponseSelector(t2t)
test: 48s, train: 1m3s, total: 1m50s
1.0000 (0.00) no data no data
Sparse + BERT + DIET(bow) + ResponseSelector(bow)
test: 21s, train: 54s, total: 1m15s
1.0000 (0.00) no data no data
Sparse + BERT + DIET(seq) + ResponseSelector(t2t)
test: 47s, train: 1m9s, total: 1m56s
1.0000 (0.00) no data no data
Sparse + DIET(bow) + ResponseSelector(bow)
test: 16s, train: 38s, total: 53s
1.0000 (0.00) no data no data
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 43s, train: 1m0s, total: 1m42s
1.0000 (0.00) no data no data
Dialog Policy Configuration Action Level Micro Avg. F1 Conversation Level Accuracy Run Time Train Run Time Test
Rules 0.5714 (0.00) 0.2500 (0.00) 10s 9s
Rules + AugMemo 1.0000 (0.00) 1.0000 (0.00) 11s 9s
Rules + AugMemo + TED 1.0000 (0.00) 1.0000 (0.00) 6m3s 27s
Rules + Memo 0.9796 (0.00) 0.9167 (0.00) 11s 9s
Rules + Memo + TED 1.0000 (0.00) 1.0000 (0.00) 6m0s 27s
Rules + TED 1.0000 (0.00) 1.0000 (0.00) 6m1s 26s
Dataset: insurance-demo, Dataset repository branch: fix-model-regression-tests (external repository), commit: 71b6ac6f9854f63e1c30b9c1ec744c9d9abf5a1a Configuration repository branch: main Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 21s, train: 39s, total: 59s
1.0000 (0.00) 1.0000 (0.00) no data
BERT + DIET(seq) + ResponseSelector(t2t)
test: 45s, train: 59s, total: 1m43s
1.0000 (0.00) no data no data
Sparse + BERT + DIET(bow) + ResponseSelector(bow)
test: 22s, train: 44s, total: 1m5s
1.0000 (0.00) 1.0000 (0.00) no data
Sparse + BERT + DIET(seq) + ResponseSelector(t2t)
test: 49s, train: 1m3s, total: 1m51s
1.0000 (0.00) 1.0000 (0.00) no data
Sparse + DIET(bow) + ResponseSelector(bow)
test: 16s, train: 32s, total: 48s
1.0000 (0.00) 1.0000 (0.00) no data
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 43s, train: 54s, total: 1m37s
1.0000 (0.00) 1.0000 (0.00) no data
Dialog Policy Configuration Action Level Micro Avg. F1 Conversation Level Accuracy Run Time Train Run Time Test
Rules 0.5909 (0.00) 0.0000 (0.00) 12s 9s
Rules + AugMemo 1.0000 (0.00) 1.0000 (0.00) 15s 8s
Rules + AugMemo + TED 1.0000 (0.00) 1.0000 (0.00) 12m30s 29s
Rules + Memo 0.7600 (0.00) 0.5000 (0.00) 14s 9s
Rules + Memo + TED 1.0000 (0.00) 1.0000 (0.00) 12m27s 29s
Rules + TED 1.0000 (0.00) 1.0000 (0.00) 12m32s 28s
Dataset: retail-demo, Dataset repository branch: fix-model-regression-tests (external repository), commit: 466128878a06638dbf4e53f1892617f86e611ef4 Configuration repository branch: main Configuration Intent Classification Micro F1 Entity Recognition Micro F1 Response Selection Micro F1
BERT + DIET(bow) + ResponseSelector(bow)
test: 26s, train: 44s, total: 1m10s
0.8387 (0.00) 0.2857 (0.00) no data
BERT + DIET(seq) + ResponseSelector(t2t)
test: 53s, train: 1m1s, total: 1m54s
0.8750 (0.00) 0.2857 (0.00) no data
Sparse + BERT + DIET(bow) + ResponseSelector(bow)
test: 28s, train: 54s, total: 1m21s
0.9375 (0.00) 0.2857 (0.00) no data
Sparse + BERT + DIET(seq) + ResponseSelector(t2t)
test: 54s, train: 1m7s, total: 2m1s
0.8125 (0.00) 0.2857 (0.00) no data
Sparse + DIET(bow) + ResponseSelector(bow)
test: 21s, train: 41s, total: 1m2s
1.0000 (0.00) 0.2857 (0.00) no data
Sparse + DIET(seq) + ResponseSelector(t2t)
test: 49s, train: 1m0s, total: 1m49s
0.9375 (0.00) 0.2857 (0.00) no data
Dialog Policy Configuration Action Level Micro Avg. F1 Conversation Level Accuracy Run Time Train Run Time Test
Rules 0.9531 (0.00) 0.7778 (0.00) 8s 10s
Rules + AugMemo 0.9692 (0.00) 0.8889 (0.00) 9s 10s
Rules + AugMemo + TED 1.0000 (0.00) 1.0000 (0.00) 4m2s 28s
Rules + Memo 0.9692 (0.00) 0.8889 (0.00) 9s 9s
Rules + Memo + TED 1.0000 (0.00) 1.0000 (0.00) 3m59s 28s
Rules + TED 1.0000 (0.00) 1.0000 (0.00) 3m56s 27s

Definition of Done:

kedz commented 2 years ago

@kedz reviewer

koernerfelicia commented 2 years ago

Hm.. so on first attempt to reproduce this I wasn't able to:

2021-11-29 12:53:59 INFO     rasa.core.test  - Evaluation Results on CONVERSATION level:
2021-11-29 12:53:59 INFO     rasa.core.test  -  Correct:          187 / 292
2021-11-29 12:53:59 INFO     rasa.core.test  -  Accuracy:         0.640
koernerfelicia commented 2 years ago

Couldn't reproduce this here, either, though it is a slightly different accuracy than the one reported above (0.6233 (0.00))

Edit: Another re-run came out with 0.6404 (0.2)

koernerfelicia commented 2 years ago

Re-run (inc. re-training) this locally twice, with the same results. Unsurprisingly, I think we can rule this out for the CPU case -- it's worth noting I don't have a GPU, so maybe this is due to some GPU non-determinism. May set up a gcp instance to try this theory out.

2021-11-30 13:48:28 INFO     rasa.core.test  - Evaluation Results on CONVERSATION level:
2021-11-30 13:48:28 INFO     rasa.core.test  -  Correct:          187 / 292
2021-11-30 13:48:28 INFO     rasa.core.test  -  Accuracy:         0.640
...
2021-11-30 14:08:20 INFO     rasa.core.test  - Evaluation Results on CONVERSATION level:
2021-11-30 14:08:20 INFO     rasa.core.test  -  Correct:          187 / 292
2021-11-30 14:08:20 INFO     rasa.core.test  -  Accuracy:         0.640
koernerfelicia commented 2 years ago

Re-ran (including training) five times on gcp instance with GPU, exact same conv. accuracy each time. Not sure what's going on here, I don't really know anything about what affects GPU non-determinism.

I'm thinking we close this as "can't reproduce" and we'll keep an eye on it? Not sure what to do about this being the new baseline... I guess it's nbd because we'd expect the next run to be back to normal. @kedz wdyt?

koernerfelicia commented 2 years ago

This is what I used to re-run, maybe I am missing sth? This is overkill, but I specifically put the models and results in different dirs to avoid getting confused about which models are loaded for which results.

for VAR in 1 2 3 4 5
do
    rm -rf .rasa
    rasa train core --force --config rules_ted.yml --stories Sara/train --domain Sara/migrated_domain.yml --out model_$VAR
    rasa test core --stories Sara/test --out results_$VAR --model model_$VAR
done

sudo shutdown -h now
koernerfelicia commented 2 years ago

So when I run model regression tests (see here) there is some variation, but nothing as significant as this drop:

0.6233
0.6404
0.6267 
0.6301
0.6199

@kedz I'd like to close this and adding it to the agenda for the retro meeting, as per the Definition of Done

kedz commented 2 years ago

Re-ran (including training) five times on gcp instance with GPU, exact same conv. accuracy each time. Not sure what's going on here, I don't really know anything about what affects GPU non-determinism.

Mysterious. Was the action level accuracy the same as well? That's pretty strange that there is no variation. Can you think of any explanations for that?

In either case I think closing this issue and making an announcement is good at this point.

koernerfelicia commented 2 years ago

🤦‍♀️ turns out I wasn't actually using the GPU on the gcp instances. I should've figured, but it turns out even with the new TF version you need to do some legwork to make everything work. Re-running now, this time with GPU

Good news is that this was a good reminder to update the Notion page, since things have changed now that we've upgraded. Re-organised and updated here

koernerfelicia commented 2 years ago

Okay, re-ran these with actual GPU support now. There is slight variation in the conversation-level accuracy, but nothing to write home about, this corresponds to a range of 182-184 (out of 292) correct conversations

0.6232876712328768,
0.6267123287671232,
0.6301369863013698,
0.6232876712328768,
0.6232876712328768

Closing this and making an announcement. Also made a followup issue for us to investigate reasonable ranges of fluctuation from GPU non-determinism, as I think it will be helpful for ourselves as well as Qs like this or issues like this