💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
Description of Problem:
Currently a lot of duplicate computing is done featurizing messages in trackers multiple times. There are two kinds of duplication:
Among the positions of the sliding window across a single conversation
Whenever we have identical messages across conversations
With small to medium-sized datasets this is not an issue. For larger Datasets such as Multiwoz, this duplication adds almost an hour of additional preprocessing time.
Overview of the Solution:
Featurize each unique message once and store the result to be used downstream by other components.
I have extracted the code from the prototype before to run tests on the current architecture and the latest version can also be found in the combined-e2e-fixes branch.
This feature would also unlock batch encoding during training, which would be too computationally expensive without having the features cached in the lookup table beforehand.
Open Issues:
How to solve for inference is still marked as TODO in the current v3 architecture prototype
Description of Problem: Currently a lot of duplicate computing is done featurizing messages in trackers multiple times. There are two kinds of duplication:
With small to medium-sized datasets this is not an issue. For larger Datasets such as Multiwoz, this duplication adds almost an hour of additional preprocessing time.
Overview of the Solution: Featurize each unique message once and store the result to be used downstream by other components.
Inside the v3 architecture prototype is a prototypical implementation of this feature. There is also a necessary, but so far unmerged fix
I have extracted the code from the prototype before to run tests on the current architecture and the latest version can also be found in the combined-e2e-fixes branch.
This feature would also unlock batch encoding during training, which would be too computationally expensive without having the features cached in the lookup table beforehand.
Open Issues:
Definition of Done: