Scheduled Model Regression Test Performance Drops

rasabot commented 2 years ago

This PR is automatically created by the Scheduled Model Regression Test workflow. Checkout the Github Action Run here. --- Description of Problem: Some test performance scores decreased. Please look at the following table for more details. Dataset: `Carbon Bot`, Dataset repository branch: `main`, commit: `819cb7b3cc077753e67178ad022d577f164e99cf`	Configuration	Intent Classification Micro F1	Entity Recognition Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `1m56s`, train: `4m14s`, total: `6m9s`	0.7942 (0.00)	0.7529 (0.00)	0.5382 (0.00)
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `3m17s`, train: `4m51s`, total: `8m8s`	0.8078 (0.00)	0.7787 (0.00)	0.5298 (0.00)
`Sparse + BERT + DIET(bow) + ResponseSelector(bow)` test: `1m58s`, train: `5m1s`, total: `6m58s`	0.7864 (0.00)	0.7529 (0.00)	0.5629 (0.01)
`Sparse + BERT + DIET(seq) + ResponseSelector(t2t)` test: `3m32s`, train: `5m45s`, total: `9m16s`	0.7806 (0.00)	0.7880 (0.00)	0.5960 (0.01)
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `56s`, train: `2m47s`, total: `3m42s`	0.7495 (0.01)	0.7529 (0.00)	0.5166 (0.01)
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `2m30s`, train: `5m20s`, total: `7m50s`	0.7398 (0.00)	0.7022 (0.00)	0.5166 (-0.02)

Dataset: `Hermit`, Dataset repository branch: `main`, commit: `819cb7b3cc077753e67178ad022d577f164e99cf`	Configuration	Intent Classification Micro F1	Entity Recognition Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `3m15s`, train: `21m37s`, total: `24m51s`	0.8978 (-0.00)	0.7504 (0.00)	`no data`
`Sparse + BERT + DIET(bow) + ResponseSelector(bow)` test: `3m22s`, train: `26m51s`, total: `30m13s`	0.8717 (0.00)	0.7504 (0.00)	`no data`
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `1m22s`, train: `20m58s`, total: `22m19s`	0.8309 (0.01)	0.7504 (0.00)	`no data`
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `2m57s`, train: `16m58s`, total: `19m55s`	0.8346 (0.00)	0.7582 (-0.00)	`no data`

Dataset: `Private 1`, Dataset repository branch: `main`, commit: `819cb7b3cc077753e67178ad022d577f164e99cf`	Configuration	Intent Classification Micro F1	Entity Recognition Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `2m28s`, train: `4m7s`, total: `6m35s`	0.9096 (0.00)	0.9612 (0.00)	`no data`
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `4m5s`, train: `4m4s`, total: `8m8s`	0.9148 (0.00)	0.9717 (0.00)	`no data`
`Spacy + DIET(bow) + ResponseSelector(bow)` test: `50s`, train: `3m11s`, total: `4m1s`	0.8420 (0.00)	0.9574 (0.00)	`no data`
`Spacy + DIET(seq) + ResponseSelector(t2t)` test: `2m34s`, train: `3m44s`, total: `6m17s`	0.8524 (-0.00)	0.9445 (0.00)	`no data`
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `43s`, train: `3m34s`, total: `4m17s`	0.8950 (-0.00)	0.9612 (0.00)	`no data`
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `2m19s`, train: `3m49s`, total: `6m7s`	0.9044 (0.00)	0.9680 (-0.00)	`no data`
`Sparse + Spacy + DIET(bow) + ResponseSelector(bow)` test: `57s`, train: `4m36s`, total: `5m32s`	0.8877 (0.00)	0.9574 (0.00)	`no data`
`Sparse + Spacy + DIET(seq) + ResponseSelector(t2t)` test: `2m35s`, train: `4m39s`, total: `7m13s`	0.8981 (0.00)	0.9661 (-0.00)	`no data`

Dataset: `Private 2`, Dataset repository branch: `main`, commit: `819cb7b3cc077753e67178ad022d577f164e99cf`	Configuration	Intent Classification Micro F1	Entity Recognition Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `2m25s`, train: `11m12s`, total: `13m37s`	0.8745 (0.00)	`no data`	`no data`
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `2m52s`, train: `6m27s`, total: `9m19s`	0.8830 (0.00)	`no data`	`no data`
`Spacy + DIET(bow) + ResponseSelector(bow)` test: `53s`, train: `5m58s`, total: `6m51s`	0.7253 (0.00)	`no data`	`no data`
`Spacy + DIET(seq) + ResponseSelector(t2t)` test: `1m24s`, train: `5m45s`, total: `7m8s`	0.7833 (-0.00)	`no data`	`no data`
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `49s`, train: `5m15s`, total: `6m3s`	0.8573 (0.01)	`no data`	`no data`
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `1m14s`, train: `5m24s`, total: `6m37s`	0.8530 (0.00)	`no data`	`no data`
`Sparse + Spacy + DIET(bow) + ResponseSelector(bow)` test: `1m3s`, train: `8m10s`, total: `9m12s`	0.8562 (-0.00)	`no data`	`no data`
`Sparse + Spacy + DIET(seq) + ResponseSelector(t2t)` test: `1m28s`, train: `7m0s`, total: `8m28s`	0.8519 (0.00)	`no data`	`no data`

Dataset: `Private 3`, Dataset repository branch: `main`, commit: `819cb7b3cc077753e67178ad022d577f164e99cf`	Configuration	Intent Classification Micro F1	Entity Recognition Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `1m5s`, train: `1m8s`, total: `2m12s`	0.9177 (0.00)	`no data`	`no data`
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `1m14s`, train: `51s`, total: `2m5s`	0.8436 (0.00)	`no data`	`no data`
`Spacy + DIET(bow) + ResponseSelector(bow)` test: `41s`, train: `57s`, total: `1m37s`	0.6173 (0.00)	`no data`	`no data`
`Spacy + DIET(seq) + ResponseSelector(t2t)` test: `52s`, train: `48s`, total: `1m39s`	0.6255 (0.00)	`no data`	`no data`
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `33s`, train: `1m2s`, total: `1m35s`	0.8683 (0.00)	`no data`	`no data`
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `45s`, train: `47s`, total: `1m31s`	0.8642 (0.00)	`no data`	`no data`
`Sparse + Spacy + DIET(bow) + ResponseSelector(bow)` test: `42s`, train: `1m19s`, total: `2m1s`	0.8477 (0.00)	`no data`	`no data`
`Sparse + Spacy + DIET(seq) + ResponseSelector(t2t)` test: `53s`, train: `59s`, total: `1m51s`	0.8601 (0.00)	`no data`	`no data`

Dataset: `Sara`, Dataset repository branch: `main`, commit: `819cb7b3cc077753e67178ad022d577f164e99cf`	Configuration	Intent Classification Micro F1	Entity Recognition Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `5m58s`, train: `6m54s`, total: `12m51s`	0.7174 (0.00)	0.7949 (0.00)	0.7981 (0.00)
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `8m16s`, train: `5m52s`, total: `14m8s`	0.7136 (0.00)	0.7925 (0.00)	0.7798 (0.00)
`Sparse + BERT + DIET(bow) + ResponseSelector(bow)` test: `6m29s`, train: `10m20s`, total: `16m49s`	0.6996 (0.01)	0.7949 (0.00)	0.8109 (0.00)
`Sparse + BERT + DIET(seq) + ResponseSelector(t2t)` test: `8m51s`, train: `8m15s`, total: `17m6s`	0.7001 (-0.00)	0.7878 (-0.00)	0.8031 (0.01)
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `2m35s`, train: `6m51s`, total: `9m26s`	0.6644 (-0.00)	0.7949 (0.00)	0.7798 (0.00)
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `5m1s`, train: `5m51s`, total: `10m51s`	0.6794 (0.00)	0.7692 (0.00)	0.7829 (-0.01)

Dialog Policy Configuration	Action Level Micro Avg. F1	Conversation Level Accuracy	Run Time Train	Run Time Test
`Rules`	0.1266 (0.00)	0.0000 (0.00)	`2m31s`	`1m20s`
`Rules + AugMemo`	0.9149 (0.00)	0.6301 (0.00)	`2m33s`	`1m32s`
`Rules + AugMemo + TED`	0.9709 (-0.00)	0.7432 (-0.01)	`46m23s`	`7m29s`
`Rules + Memo`	0.3860 (0.00)	0.1438 (0.00)	`2m29s`	`1m26s`
`Rules + Memo + TED`	0.9476 (0.00)	0.6370 (0.01)	`44m54s`	`7m24s`
`Rules + TED`	0.9389 (-0.01)	0.5342 (-0.10)	`45m48s`	`7m18s`

Dataset: `financial-demo`, Dataset repository branch: `fix-model-regression-tests` (external repository), commit: `52a3ad3eb5292d56542687e23b06703431f15ead` Configuration repository branch: `main`	Configuration	Intent Classification Micro F1	Entity Recognition Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `28s`, train: `51s`, total: `1m18s`	1.0000 (0.00)	0.8333 (0.00)	`no data`
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `54s`, train: `1m13s`, total: `2m7s`	1.0000 (0.00)	0.8333 (0.00)	`no data`
`Sparse + BERT + DIET(bow) + ResponseSelector(bow)` test: `28s`, train: `1m5s`, total: `1m32s`	1.0000 (0.00)	0.8333 (0.00)	`no data`
`Sparse + BERT + DIET(seq) + ResponseSelector(t2t)` test: `55s`, train: `1m19s`, total: `2m13s`	1.0000 (0.00)	0.8800 (0.00)	`no data`
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `19s`, train: `43s`, total: `1m2s`	0.9643 (0.00)	0.8333 (0.00)	`no data`
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `47s`, train: `1m6s`, total: `1m53s`	0.9643 (0.00)	0.8800 (0.00)	`no data`

Dialog Policy Configuration	Action Level Micro Avg. F1	Conversation Level Accuracy	Run Time Train	Run Time Test
`Rules`	0.7218 (0.00)	0.5417 (0.00)	`19s`	`11s`
`Rules + AugMemo`	1.0000 (0.00)	1.0000 (0.00)	`20s`	`12s`
`Rules + AugMemo + TED`	1.0000 (0.00)	1.0000 (0.00)	`7m4s`	`50s`
`Rules + Memo`	0.9807 (0.00)	0.9167 (0.00)	`21s`	`12s`
`Rules + Memo + TED`	1.0000 (0.00)	1.0000 (0.00)	`7m8s`	`51s`
`Rules + TED`	0.8856 (0.00)	0.6875 (0.00)	`6m50s`	`51s`

Dataset: `helpdesk-assistant`, Dataset repository branch: `fix-model-regression-tests` (external repository), commit: `62d7ae28585a32b2fbfceb4dd7ce233e53920238` Configuration repository branch: `main`	Configuration	Intent Classification Micro F1	Entity Recognition Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `20s`, train: `45s`, total: `1m5s`	1.0000 (0.00)	`no data`	`no data`
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `48s`, train: `1m3s`, total: `1m50s`	1.0000 (0.00)	`no data`	`no data`
`Sparse + BERT + DIET(bow) + ResponseSelector(bow)` test: `21s`, train: `54s`, total: `1m15s`	1.0000 (0.00)	`no data`	`no data`
`Sparse + BERT + DIET(seq) + ResponseSelector(t2t)` test: `47s`, train: `1m9s`, total: `1m56s`	1.0000 (0.00)	`no data`	`no data`
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `16s`, train: `38s`, total: `53s`	1.0000 (0.00)	`no data`	`no data`
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `43s`, train: `1m0s`, total: `1m42s`	1.0000 (0.00)	`no data`	`no data`

Dialog Policy Configuration	Action Level Micro Avg. F1	Conversation Level Accuracy	Run Time Train	Run Time Test
`Rules`	0.5714 (0.00)	0.2500 (0.00)	`10s`	`9s`
`Rules + AugMemo`	1.0000 (0.00)	1.0000 (0.00)	`11s`	`9s`
`Rules + AugMemo + TED`	1.0000 (0.00)	1.0000 (0.00)	`6m3s`	`27s`
`Rules + Memo`	0.9796 (0.00)	0.9167 (0.00)	`11s`	`9s`
`Rules + Memo + TED`	1.0000 (0.00)	1.0000 (0.00)	`6m0s`	`27s`
`Rules + TED`	1.0000 (0.00)	1.0000 (0.00)	`6m1s`	`26s`

Dataset: `insurance-demo`, Dataset repository branch: `fix-model-regression-tests` (external repository), commit: `71b6ac6f9854f63e1c30b9c1ec744c9d9abf5a1a` Configuration repository branch: `main`	Configuration	Intent Classification Micro F1	Entity Recognition Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `21s`, train: `39s`, total: `59s`	1.0000 (0.00)	1.0000 (0.00)	`no data`
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `45s`, train: `59s`, total: `1m43s`	1.0000 (0.00)	`no data`	`no data`
`Sparse + BERT + DIET(bow) + ResponseSelector(bow)` test: `22s`, train: `44s`, total: `1m5s`	1.0000 (0.00)	1.0000 (0.00)	`no data`
`Sparse + BERT + DIET(seq) + ResponseSelector(t2t)` test: `49s`, train: `1m3s`, total: `1m51s`	1.0000 (0.00)	1.0000 (0.00)	`no data`
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `16s`, train: `32s`, total: `48s`	1.0000 (0.00)	1.0000 (0.00)	`no data`
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `43s`, train: `54s`, total: `1m37s`	1.0000 (0.00)	1.0000 (0.00)	`no data`

Dialog Policy Configuration	Action Level Micro Avg. F1	Conversation Level Accuracy	Run Time Train	Run Time Test
`Rules`	0.5909 (0.00)	0.0000 (0.00)	`12s`	`9s`
`Rules + AugMemo`	1.0000 (0.00)	1.0000 (0.00)	`15s`	`8s`
`Rules + AugMemo + TED`	1.0000 (0.00)	1.0000 (0.00)	`12m30s`	`29s`
`Rules + Memo`	0.7600 (0.00)	0.5000 (0.00)	`14s`	`9s`
`Rules + Memo + TED`	1.0000 (0.00)	1.0000 (0.00)	`12m27s`	`29s`
`Rules + TED`	1.0000 (0.00)	1.0000 (0.00)	`12m32s`	`28s`

Dataset: `retail-demo`, Dataset repository branch: `fix-model-regression-tests` (external repository), commit: `466128878a06638dbf4e53f1892617f86e611ef4` Configuration repository branch: `main`	Configuration	Intent Classification Micro F1	Entity Recognition Micro F1
`BERT + DIET(bow) + ResponseSelector(bow)` test: `26s`, train: `44s`, total: `1m10s`	0.8387 (0.00)	0.2857 (0.00)	`no data`
`BERT + DIET(seq) + ResponseSelector(t2t)` test: `53s`, train: `1m1s`, total: `1m54s`	0.8750 (0.00)	0.2857 (0.00)	`no data`
`Sparse + BERT + DIET(bow) + ResponseSelector(bow)` test: `28s`, train: `54s`, total: `1m21s`	0.9375 (0.00)	0.2857 (0.00)	`no data`
`Sparse + BERT + DIET(seq) + ResponseSelector(t2t)` test: `54s`, train: `1m7s`, total: `2m1s`	0.8125 (0.00)	0.2857 (0.00)	`no data`
`Sparse + DIET(bow) + ResponseSelector(bow)` test: `21s`, train: `41s`, total: `1m2s`	1.0000 (0.00)	0.2857 (0.00)	`no data`
`Sparse + DIET(seq) + ResponseSelector(t2t)` test: `49s`, train: `1m0s`, total: `1m49s`	0.9375 (0.00)	0.2857 (0.00)	`no data`

Dialog Policy Configuration	Action Level Micro Avg. F1	Conversation Level Accuracy	Run Time Train	Run Time Test
`Rules`	0.9531 (0.00)	0.7778 (0.00)	`8s`	`10s`
`Rules + AugMemo`	0.9692 (0.00)	0.8889 (0.00)	`9s`	`10s`
`Rules + AugMemo + TED`	1.0000 (0.00)	1.0000 (0.00)	`4m2s`	`28s`
`Rules + Memo`	0.9692 (0.00)	0.8889 (0.00)	`9s`	`9s`
`Rules + Memo + TED`	1.0000 (0.00)	1.0000 (0.00)	`3m59s`	`28s`
`Rules + TED`	1.0000 (0.00)	1.0000 (0.00)	`3m56s`	`27s`

Definition of Done:

[x] Figure out what caused the performance drop on Sara dataset with Rules + TED config. This might require to rerun the config + dataset combo multiple times to see if we can reproduce the issue.
[x] If we cannot reproduce the issue, ignore the issue for now but post an update about the issue in the next sprint planning meeting so that if something similar happens in future, the team knows about possible flakiness.
[ ] If we can reproduce the issue, create a bugfix PR which is reviewed, run model regression tests on the PR and get it merged.
[ ] Release a new micro version of Rasa Open Source.

kedz commented 2 years ago

@kedz reviewer

koernerfelicia commented 2 years ago

Hm.. so on first attempt to reproduce this I wasn't able to:

2021-11-29 12:53:59 INFO     rasa.core.test  - Evaluation Results on CONVERSATION level:
2021-11-29 12:53:59 INFO     rasa.core.test  -  Correct:          187 / 292
2021-11-29 12:53:59 INFO     rasa.core.test  -  Accuracy:         0.640

koernerfelicia commented 2 years ago

Couldn't reproduce this here, either, though it is a slightly different accuracy than the one reported above (0.6233 (0.00))

Edit: Another re-run came out with 0.6404 (0.2)

koernerfelicia commented 2 years ago

Re-run (inc. re-training) this locally twice, with the same results. Unsurprisingly, I think we can rule this out for the CPU case -- it's worth noting I don't have a GPU, so maybe this is due to some GPU non-determinism. May set up a gcp instance to try this theory out.

2021-11-30 13:48:28 INFO     rasa.core.test  - Evaluation Results on CONVERSATION level:
2021-11-30 13:48:28 INFO     rasa.core.test  -  Correct:          187 / 292
2021-11-30 13:48:28 INFO     rasa.core.test  -  Accuracy:         0.640
...
2021-11-30 14:08:20 INFO     rasa.core.test  - Evaluation Results on CONVERSATION level:
2021-11-30 14:08:20 INFO     rasa.core.test  -  Correct:          187 / 292
2021-11-30 14:08:20 INFO     rasa.core.test  -  Accuracy:         0.640

koernerfelicia commented 2 years ago

Re-ran (including training) five times on gcp instance with GPU, exact same conv. accuracy each time. Not sure what's going on here, I don't really know anything about what affects GPU non-determinism.

I'm thinking we close this as "can't reproduce" and we'll keep an eye on it? Not sure what to do about this being the new baseline... I guess it's nbd because we'd expect the next run to be back to normal. @kedz wdyt?

koernerfelicia commented 2 years ago

This is what I used to re-run, maybe I am missing sth? This is overkill, but I specifically put the models and results in different dirs to avoid getting confused about which models are loaded for which results.

for VAR in 1 2 3 4 5
do
    rm -rf .rasa
    rasa train core --force --config rules_ted.yml --stories Sara/train --domain Sara/migrated_domain.yml --out model_$VAR
    rasa test core --stories Sara/test --out results_$VAR --model model_$VAR
done

sudo shutdown -h now

koernerfelicia commented 2 years ago

So when I run model regression tests (see here) there is some variation, but nothing as significant as this drop:

@kedz I'd like to close this and adding it to the agenda for the retro meeting, as per the Definition of Done

kedz commented 2 years ago

Re-ran (including training) five times on gcp instance with GPU, exact same conv. accuracy each time. Not sure what's going on here, I don't really know anything about what affects GPU non-determinism.

Mysterious. Was the action level accuracy the same as well? That's pretty strange that there is no variation. Can you think of any explanations for that?

In either case I think closing this issue and making an announcement is good at this point.

koernerfelicia commented 2 years ago

🤦‍♀️ turns out I wasn't actually using the GPU on the gcp instances. I should've figured, but it turns out even with the new TF version you need to do some legwork to make everything work. Re-running now, this time with GPU

Good news is that this was a good reminder to update the Notion page, since things have changed now that we've upgraded. Re-organised and updated here

koernerfelicia commented 2 years ago

Okay, re-ran these with actual GPU support now. There is slight variation in the conversation-level accuracy, but nothing to write home about, this corresponds to a range of 182-184 (out of 292) correct conversations

0.6232876712328768,
0.6267123287671232,
0.6301369863013698,
0.6232876712328768,
0.6232876712328768

Closing this and making an announcement. Also made a followup issue for us to investigate reasonable ranges of fluctuation from GPU non-determinism, as I think it will be helpful for ourselves as well as Qs like this or issues like this

RasaHQ / rasa

Scheduled Model Regression Test Performance Drops #10415