Closed aeshky closed 3 years ago
Using the example above:
It looks like even for the same configuration we are unable to reproduce the results. The difference is not just floating point precision errors (example see 2.3, 2.5, and 2.7).
I shared the full results on this branch.
We currently run regression tests on GPUs and only track the overall f1 score. Because it is difficult to reproduce behaviour on GPUs, we check that the numbers are roughly the same as previous runs. This is common for stochastic processes (especially where neural networks are involved) and seems like a perfectly reasonable setup.
Behaviour is more easily reproducable on CPUs (with small floating point errors), however, we don't run regression tests on CPUs (and it's not clear whether there is value in doing so). We also don't check the results of individual samples (e.g., model confidence for each intent). Therefore it is entirely possibly to get variations like the example above (for the same config using different rasa versions).
Some of the changes in the results of regression tests will be due to randomness, and others will be due to changes in third party libraries or changes that we make. Our regression tests ensure that there is no significant performance drop on benchmark datasets after any change.
Training the two different configs on the same version (Rasa 1.10.0) gives similar, but not identical, results. See 1.10.0 EmbeddingIntentClassifier and 1.10.0 DIETClassifier.
Update:
I noticed that the default values for the arguments below are different for EmbeddingIntentClassifier
vs. DIETClassifier
. Setting them as shown below in the v2 config produces results that are more similar:
- name: "DIETClassifier"
scale_loss: True
use_sparse_input_dropout: False
use_dense_input_dropout: False
See 1.10.0 EmbeddingIntentClassifier and 1.10.0 DIETClassifier. The outputs are still not identical (see here and here), so maybe there are other default parameters that we need to manually set.
Communicate the following to the customer: The two configurations above map to the same underlying model, and should give you the same overall results provided you are using the same rasa version. When comparing the performance of the two configs on two different rasa versions, you may notice a difference in performance. The reason is explained here.
Come up with a new suitable configuration for them.
I've communicated to the customer to retrain their 2.5.0 model with the updated config for DIETClassifier
Problem
The migration docs imply that the following configuration in Rasa 2.x:
should give you the same results as the following configurations in Rasa 1.x:
However, we have created a minimum example with moodbot that gives different results with the two configurations.
This is a high priority issue that impacts a customer who has experienced a performance drop after migrating from rasa version 1.5.1 to 2.5.0 using the above settings
Definition of Done
If not:
explaining why.If the results should be identical:
If it's a bug, create an issue for the fix.If yes, create an issue for the test.: no regression test needed.