Closed jpwchang closed 2 years ago
Per our discussion on Slack, I have updated this branch with changes to make certain destructive Corpus
operations (merging, reindexing, and filtering by utterance) static methods. The actual implementation and behavior of these methods remains unchanged. Additionally, add_utterances
has been updated to always operate in-place (before, it was in-place when with_checks=False
and not-in-place otherwise). Making this change required decoupling add_utterances
from merge
since the latter is not-in-place. Tests for add_utterances
have also been updated to more thoroughly verify the integrity of the modified Corpus
, and to verify that the operation is in fact in-place (i.e., returns the same Corpus
it was called from)
@jpwchang Can we merge this?
Description
Introduces a new layer of abstraction between Corpus components (
Utterance
,Speaker
,Conversation
,ConvoKitMeta
) and concrete data storage. Data storage is now handled by aStorageManager
instance variable in the Corpus.StorageManager
defines an abstract class interface that can have concrete implementations with varying storage scheme; this pull request implementsMemStorageManager
which stores all data in in-memory dicts, thus replicating the existing behavior.Motivation and Context
In previous versions of ConvoKit, Corpus components have stored their data and metadata directly as instance variables. This scheme inherently means that the data must live directly in memory. Plans are currently underway to introduce alternative data backings such as database-backed storage. Introducing an abstraction layer for data storage will allow these alternative data models to be implemented without having to reimplement every Corpus component class from scratch.
How has this been tested?
Besides ensuring that all existing unit tests pass, tests have also been expanded to add new checks for
StorageManager
integrity after complex operations such as merges.