Make RulePolicy more efficient with large trackers - Githubissues

RasaHQ / rasa

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

https://rasa.com/docs/rasa/

Apache License 2.0

18.92k stars 4.63k forks source link

Make RulePolicy more efficient with large trackers #9089

Closed desmarchris closed 3 years ago

desmarchris commented 3 years ago

Description of Problem: Large trackers can cause a slowdown / memory spike when making a prediction with RulePolicy.

Definition of Done:

[x] Reproduce problem
[ ] Scope possible fixes

desmarchris commented 3 years ago

@TyDunn I wasn't sure what squad this would belong under, lmk if it's not enable :)

TyDunn commented 3 years ago

@desmarchris I'd check out this thread on Slack that discusses some related things

kedz commented 3 years ago

@JEM-Mosig does your bug fix PR possibly also address this issue?

JEM-Mosig commented 3 years ago

@kedz No, this will not affect RulePolicy, since I only touch the AugmentedMemoizationPolicy class and RulePolicy is derived from MemoizationPolicy.

JEM-Mosig commented 3 years ago

I'd rephrase this issue to "make RulePolicy faster with large trackers". Working towards a solution would entail:

Create an example dataset on which RulePolicy is clearly too slow
Identify which operation takes the most time
Figure out how to accelerate this operation
Create a PR

desmarchris commented 3 years ago

changed it slightly bc it's not just about time but also memory/cpu consumption

JEM-Mosig commented 3 years ago

@desmarchris I'm trying to reproduce this right now. To be sure: the problem was with inference only, right? Not with training. Do we have an example tracker?

desmarchris commented 3 years ago

correct, just at inference. I'm sending you an example for a single prediction that's 2500 lines long on Slack.

JEM-Mosig commented 3 years ago

Oh, I think the linear time increase is actually the issue. Rule policy prediction time increases roughly linearly with the tracker length. But MappingPolicy didn't check the tracker at all (except last message), so it'd be constant time for that one. Its this difference between MappingPolicy and RulePolicy that led to this issue. But we're generally talking about 1 s lag times. Can you confirm this, @desmarchris ?

desmarchris commented 3 years ago

the specific issue at hand here was literally printing 2500 lines to make a prediction caused a CPU spike that brought the pod close to its max capacity

JEM-Mosig commented 3 years ago

Printing? So this has nothing to do with the policy, but only with the fact that debug was on?

desmarchris commented 3 years ago

I think so? but debug was needed to help find what went wrong with the bot. I don't know anyone running a bot in prod without debug turned on

JEM-Mosig commented 3 years ago

I don't know anyone running a bot in prod without debug turned on

That is very concerning. It's not meant to be used this way. I suppose we can have RulePolicy just not print the state in the short term, and in the long term... find a way to store debug logs more quickly? @koaning had this idea of a streamlined debug log (--insights?), but I suppose customers would just always go for printing everything. So we'd have to add a "secret debug flag" that is really only to be used for us inside Rasa.

JEM-Mosig commented 3 years ago

I see we actually have a Issue https://github.com/RasaHQ/rasa/issues/7295 for that already.

I'll close the present issue now, noting that the abovementioned issue would fix the slow debug output, which caused the present issue in the first place.

If we wanted to actually accelerate RulePolicy, we could speed up the lookup of rules. What makes this slightly complicated is that in contrast to "MappingPolicy rules", the "RulePolicy rules" can be nested. I.e. you can have a short rule inside a long rule. Even if the short rule applies, you still need to check if you can find a long rule that is more specific. Surely we can create a custom algorithm that is faster than the linear time one that we have right now. Again, given that this doesn't actually cause an issue with the customer, and the present issue is about scoping solutions, I'll close the present issue.