CornellNLP / ConvoKit

ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.
https://convokit.cornell.edu/documentation/
MIT License
556 stars 129 forks source link

Storage abstraction #169

Closed jpwchang closed 2 years ago

jpwchang commented 2 years ago

Description

Introduces a new layer of abstraction between Corpus components (Utterance, Speaker, Conversation, ConvoKitMeta) and concrete data storage. Data storage is now handled by a StorageManager instance variable in the Corpus. StorageManager defines an abstract class interface that can have concrete implementations with varying storage scheme; this pull request implements MemStorageManager which stores all data in in-memory dicts, thus replicating the existing behavior.

Motivation and Context

In previous versions of ConvoKit, Corpus components have stored their data and metadata directly as instance variables. This scheme inherently means that the data must live directly in memory. Plans are currently underway to introduce alternative data backings such as database-backed storage. Introducing an abstraction layer for data storage will allow these alternative data models to be implemented without having to reimplement every Corpus component class from scratch.

How has this been tested?

Besides ensuring that all existing unit tests pass, tests have also been expanded to add new checks for StorageManager integrity after complex operations such as merges.

jpwchang commented 2 years ago

Per our discussion on Slack, I have updated this branch with changes to make certain destructive Corpus operations (merging, reindexing, and filtering by utterance) static methods. The actual implementation and behavior of these methods remains unchanged. Additionally, add_utterances has been updated to always operate in-place (before, it was in-place when with_checks=False and not-in-place otherwise). Making this change required decoupling add_utterances from merge since the latter is not-in-place. Tests for add_utterances have also been updated to more thoroughly verify the integrity of the modified Corpus, and to verify that the operation is in fact in-place (i.e., returns the same Corpus it was called from)

calebchiam commented 2 years ago

@jpwchang Can we merge this?