RasaHQ / rasa

πŸ’¬ Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
https://rasa.com/docs/rasa/
Apache License 2.0
18.52k stars 4.58k forks source link

`rasa test` to include action server #6796

Closed ArjaanBuijk closed 1 year ago

ArjaanBuijk commented 3 years ago

Description of Problem: How do I test that my bot is working?

Currently, rasa test :

This means that even when rasa test shows that all the e2e stories pass, it is not a guarantee that the bot is functional. One has to test all the e2e stories manually, which is a lot of work and it slows down the CDD iterations.

Doing tests with the action server included is especialy important during the CDD Fix step, or when upgrading a bot from Rasa OSS 1 to Rasa OSS 2.

A side benefit of including the action server during rasa test is that it would provide a nice method for debugging custom actions and it provides a declaritive mechanism to create full coverage tests of the custom actions. (https://github.com/RasaHQ/rasa/issues/4212#issuecomment-698888514)

Overview of the Solution: rasa test calls the action server.

Blockers (if relevant): It is not a blocker, because it is standard testing practice. But, the user must be careful that the action server is running in some kind of local mode or testing mode. Eg. during testing it is not actually updating a production data-base or calls external services.

Slack thread

TyDunn commented 3 years ago

Exalate commented:

TyDunn commented:

Anything you would add @akelad? We have discussed this in the past

TyDunn commented 3 years ago

Exalate commented:

TyDunn commented:

Context behind past decision to not execute actions in test stories: https://forum.rasa.com/t/snapshot-based-testing-with-rasa/13318

akelad commented 3 years ago

Exalate commented:

akelad commented:

not much to add to this, but we have had customers bring this up a few times in the past year.

Having a mock action server for testing I believe is common in the enterprise anyways -otherwise you would be modifying "real world things" with your manual tests as well. So I feel like allowing rasa test to include the action server again could make sense

TyDunn commented 3 years ago

Exalate commented:

TyDunn commented:

I really think we should make running the actions a possibility, especially given another data point that Ella added to the Slack thread yesterday. I'm adding this to the Rasa Open Source 2.1 milestone to push this discussion more

twerkmeister commented 3 years ago

Exalate commented:

twerkmeister commented:

So I've looked a bit into this issue. I think this is actually quite a big architectural and dev experience decision and I would like to discuss this a bit more to make sure we are on a good path forward.

If I understand correctly the proposal is that when using rasa test the action server should run alongside so that custom action code is run to validate that slot_was_set is actually happening and use this as test cases for the custom actions. As a result, it would be easier to test that the actions are actually doing what they are supposed to do.

In the following I am comparing a unit testing-based approach that we currently advise in the docs to this approach (yellow box at the end of this section). While we have that hint in the rasa docs, there is no section on testing actions in the actions server docs

⚠️

Unit Testing

Pros
  1. Extensive test capabilities beyond simple equality for values. slot_was_set would currently just amount to direct equality check
  2. Simpler to test actions for multiple different values
  3. Fosters configurability and testability of actions from the get go as opposed to making it an afterthought
  4. Separating testing the machine learning model and manually written code leads to faster turn around time for developing actions
  5. Testing different environments for actions is straight forward (e.g. working service, failing service, slow service, ...)
Cons
  1. Seems Difficult/Cumbersome to manually write realistic tracker states as inputs for unit tests

Action server in rasa test

Pros
  1. some test cases are already there and we can use them
Cons
  1. Test capabilities are limited; Developing additional capabilities might be quite complex (building our own testing framework ). As stated above slot_was_set would currently just amount to direct equality check
  2. Tight coupling of model and action testing - I am thinking, if you want to debug an action, you might want to do so for specific test cases - this means we would ideally also have a mechanism to single out specific test stories from the set instead of running all of them. This already easily doable in unit testing - just pick your test case - but needs additional deliberation and work in the conversation setting.
  3. I think even with this capability in place, you might still wanna unit test your actions for situations when background services fail etc. You could do this here, but then you would need to specify environment configs alongside stories

In my opinion, we should go with unit testing approach and make it easier to understand and use:

What do you think @ArjaanBuijk , @koaning, @TyDunn. Is this an accurate overview of the choices and their merits? Did I miss something? Please discuss

TyDunn commented 3 years ago

Exalate commented:

TyDunn commented:

From a high level, this seems like a fair overview at first glance. Curious to also hear what @wochinge thinks, since he spent some time in the code last Friday, as well as @akelad and the others you tagged

wochinge commented 3 years ago

Exalate commented:

wochinge commented:

Thanks for writing this detailed overview!

πŸ’―

In my opinion we need both. I see conversation tests as integration tests. From that perspective it makes sense to include the action server in the. conversation tests as you want to ensure the behavior of all components combined. It's less about checking certain slot values in my opinion, but rather whether the conversation flows as expected (e.g. that the custom action isn't causing the bot to say goodbye after the user said hi). This is of course also possible with unit testing but I think it's fragile as:

Considering a team workflow where the conversation designer comes up with the conversations and the developers implement them, I'd expect the conversation designer also to be the one who is acceptance testing the assistant's behavior. Ideally they can define a definition of done (stories of done

πŸ˜† ) right at the beginning which the developers can use to develop the bot against. Although this makes me cringe as developer, I'd argue that unit tests are less of a priority / have less value in this setting. > Tight coupling of model and action testing Interesting point. I think actions which set featurized conversation state are somewhat part of the model πŸ€” Slightly off topic: You're `slot_was_set` example touches a very interesting point. As far as I know our conversation tests do not test slot values at the moment. In my opinion this is less interesting for featurized slots as they will influence the conversation flow anyway, but it might be interesting to check for unfeaturized slot values (e.g. whether the form extract slots as expected).
akelad commented 3 years ago

Exalate commented:

akelad commented:

+1 to Tobias' opinion, for me the full conversation tests are more important, i.e. the action server in rasa test option. Every customer we've ever worked with always needs to test the full flow, including the execution of the custom actions. Something we should consider though, is that you might not want to run your "prod custom action" because that might be modifying real world data. So you'd have some sort of mock custom action instead - not actually sure how our customers handle this, though i'd imagine they just have slight variations of custom actions in their dev/qa/prod envs?

twerkmeister commented 3 years ago

Exalate commented:

twerkmeister commented:

Thanks for the food for thought! Lots of new info here for me will look into these things

wochinge commented 3 years ago

Exalate commented:

wochinge commented:

So you'd have some sort of mock custom action instead - not actually sure how our customers handle this, though i'd imagine they just have slight variations of custom actions in their dev/qa/prod envs?

Up to them to create a proper integration test environment, no? If their custom actions need a database than their integration tests need a test database 🀷🏻

akelad commented 3 years ago

Exalate commented:

akelad commented:

yeah i was kind of thinking aloud and came to this conclusion in the end as well

TyDunn commented 3 years ago

Exalate commented:

TyDunn commented:

Something we should consider though, is that you might not want to run your "prod custom action" because that might be modifying real world data. So you'd have some sort of mock custom action instead - not actually sure how our customers handle this, though i'd imagine they just have slight variations of custom actions in their dev/qa/prod envs?

@erohmensing @ArjaanBuijk You mentioned some customers that you work with already run actions as part of whole conversation tests. Do you know how they handle this?

ArjaanBuijk commented 3 years ago

Exalate commented:

ArjaanBuijk commented:

@TyDunn , There were always non-prod environments to run these tests, and those non-prod environments are complete copies of the prod environment, including the action server.

twerkmeister commented 3 years ago

Exalate commented:

twerkmeister commented:

Wrapping up my current thoughts and findings ...

@wochinge As some more context have a look at the conversation between @ArjaanBuijk and @TyDunn in slack from a couple months back:

The [...]-demo has a lot of action code for slot validation, and it sets additional slots on the fly. There is no way right now to test if this works, except by doing manual testing.

Arjaan continues:

Seems @ArjaanBuijk first and foremost cares about slot set events. At least in that conversation there was no mention of bot utterances. Maybe you can confirm?

From this context, I focused more on the slot events in my overview.

@Wochinge You raise an interesting point though about the bot utterances by the action. If I am not mistaken these dispatched utterances appear neither in training data nor in current test stories, or do they?

For example, the rasa-demo bot has an action for greeting users, which utters a bunch of messages. The training stories do not however refer to any of these utterances. Likewise the test stories do not refer to any of these utterances. Is this demo code done in a wrong way? I am not sure whether there is a way to capture these utterances in the training or test stories file formats as of now.

wochinge commented 3 years ago

Exalate commented:

wochinge commented:

If I am not mistaken these dispatched utterances appear neither in training data nor in current test stories, or do they?

No, they don't. Besides SlotSet events we there might be other "featurized" things to test though, e.g. ActiveLoop or FollowupAction etc.

ArjaanBuijk commented 3 years ago

Exalate commented:

ArjaanBuijk commented:

@twerkmeister ,

I indeed mentioned the slots being set by actions, but there are also other events that need to be checked, as @wochinge points out.

A problem with not including a live action server in the e2e testing is that the e2e tests tend to go out of sync with the custom actions. It is very easy forget to update the e2e test when you update a custom action. Even when you write unit tests for the custom actions, you are not actually testing the full bot.

m-vdb commented 3 years ago

Exalate commented:

m-vdb commented:

For the record, decreasing priority to normal. We'll need time to scope the feature properly and we need to prioritise it together with our other initiatives (see Slack thread in issue description for more info)

akelad commented 3 years ago

Exalate commented:

akelad commented:

This has come up a bunch of times over the years from multiple customers

indam23 commented 3 years ago

Exalate commented:

melindaloubser1 commented:

One more data point from a user, in favour of testing custom actions:

In summary: The combination of not being able to direct a conversation based on intent x entity value, and not being able to run the custom actions that work around that limitation, is particularly frustrating.

Currently, entity values cannot directly influence action prediction, only entity types. If you autofill a categorical slot of the same name with the entity value, you can direct the next step of the conversation, but now it also gets set at every point in the conversation where that entity is extracted, which is not always desirable.

To work around this limitation, you can create a slot with a different name, and fill it from the entity using a custom action only in those stories where it is needed. Depending on how many instances of this you have, you can end up with many more custom actions than before, and none of them are run during testing.

rgstephens commented 2 years ago

Exalate commented:

rgstephens commented:

Are there any updates on this?

wochinge commented 2 years ago

Exalate commented:

wochinge commented:

Enable has done a spike on conversation testing and then benched it for 3.0 after talking to a few customers as we've realized that it doesn't make sense to do something hacky which doesn't work longterm. What I know from @TyDunn is that it's gonna one of the two key things for 3.1

rgstephens commented 2 years ago

Exalate commented:

rgstephens commented:

Is this still slated for 3.1? I'm doing a lot of manual testing with rasa shell --debug because of this and #9013

m-vdb commented 2 years ago

Exalate commented:

m-vdb commented:

It's currently planned for next year and unclear in which release it will go into. This is an issue we want to address holistically with other pain points our customers are seeing. Stay tuned!

Nummulit commented 2 years ago

Hello :) Is it maybe possible to get an estimate when we could expect solution to this issue? It just occurred to me that the values of slow_was_set are not being considered in test stories (neither is slot extraction or validation inside forms) and as I understand this boils down to being able to test Rasa with running actions server. Knowing the potential time frame could help me decide if writing a custom solution is worth the effort.

rgstephens commented 2 years ago

As a workaround, you'll find a pytest example in the financial-demo here.

m-vdb commented 1 year ago

Closing this issue as we're planning more discovery on this topic in the next months

lumpidu commented 1 year ago

Believe it or not: this feature is absolutely necessary! I cannot believe that it's silently closed more than 2 years after initially opening and where people are clearly stating that this is an important feature. There is such a pain with testing Rasa because of that. At least you could have transferred this issue over to Jira. For what should these issues be good for otherwise ?

m-vdb commented 1 year ago

@lumpidu thanks for your comment, and sorry for the miscommunication on my end. I had written this above:

Closing this issue as we're planning more discovery on this topic in the next months

We're continuing internal work on this, and you should hear a few updates in the next months. I'll see about re-creating a Jira ticket in the OSS backlog. Thanks a lot for your interest!

sutgeorge commented 1 month ago

Is this issue still in progress?