EmbeddingIntentClassifier behaviour not reproducable with DIETClassifier

aeshky commented 3 years ago

Problem

The migration docs imply that the following configuration in Rasa 2.x:

- name: "DIETClassifier"
  epochs: 300 
  hidden_layers_sizes: 
    text: [256, 128]
  number_of_transformer_layers: 0
  weight_sparsity: 0
  intent_classification: True
  entity_recognition: False
  use_masked_language_model: False
  BILOU_flag: False
  random_seed: 1

should give you the same results as the following configurations in Rasa 1.x:

- name: "EmbeddingIntentClassifier"
  epochs: 300
  random_seed: 1

However, we have created a minimum example with moodbot that gives different results with the two configurations.

This is a high priority issue that impacts a customer who has experienced a performance drop after migrating from rasa version 1.5.1 to 2.5.0 using the above settings

Definition of Done

[x] Confirm whether or not the results should be identical: they shouldn't if the Rasa version is different. See this comment.

If not:

[x] Make this explicit in the migration docs, ~~explaining why~~.
[x] Communicate this the customer and create another issue to address the drop in their performance.

If the results should be identical:

[x] Identify the source of the discrepancy. ~~If it's a bug, create an issue for the fix~~.
[x] Decide whether to create a regression test to ensure identical behaviour. ~~If yes, create an issue for the test.~~: no regression test needed.

aeshky commented 3 years ago

Using the example above:

I ran the v1 configuration on rasa versions 1.5.1 and 1.10 and got different results.
I also ran the v2 config on all rasa versions 2.x.0 and got different results each time (with a couple of exceptions).

It looks like even for the same configuration we are unable to reproduce the results. The difference is not just floating point precision errors (example see 2.3, 2.5, and 2.7).

I shared the full results on this branch.

aeshky commented 3 years ago

Clarification regarding the discrepancy in results

We currently run regression tests on GPUs and only track the overall f1 score. Because it is difficult to reproduce behaviour on GPUs, we check that the numbers are roughly the same as previous runs. This is common for stochastic processes (especially where neural networks are involved) and seems like a perfectly reasonable setup.

Behaviour is more easily reproducable on CPUs (with small floating point errors), however, we don't run regression tests on CPUs (and it's not clear whether there is value in doing so). We also don't check the results of individual samples (e.g., model confidence for each intent). Therefore it is entirely possibly to get variations like the example above (for the same config using different rasa versions).

Some of the changes in the results of regression tests will be due to randomness, and others will be due to changes in third party libraries or changes that we make. Our regression tests ensure that there is no significant performance drop on benchmark datasets after any change.

aeshky commented 3 years ago

Training the two different configs on the same version (Rasa 1.10.0) gives similar, but not identical, results. See 1.10.0 EmbeddingIntentClassifier and 1.10.0 DIETClassifier.

aeshky commented 3 years ago

Update: I noticed that the default values for the arguments below are different for EmbeddingIntentClassifier vs. DIETClassifier. Setting them as shown below in the v2 config produces results that are more similar:

- name: "DIETClassifier"
  scale_loss: True
  use_sparse_input_dropout: False
  use_dense_input_dropout: False

See 1.10.0 EmbeddingIntentClassifier and 1.10.0 DIETClassifier. The outputs are still not identical (see here and here), so maybe there are other default parameters that we need to manually set.

aeshky commented 3 years ago

Next steps

Communicate the following to the customer: The two configurations above map to the same underlying model, and should give you the same overall results provided you are using the same rasa version. When comparing the performance of the two configs on two different rasa versions, you may notice a difference in performance. The reason is explained here.
Come up with a new suitable configuration for them.

hsm207 commented 3 years ago

I've communicated to the customer to retrain their 2.5.0 model with the updated config for DIETClassifier

RasaHQ / rasa