CornellNLP / ConvoKit

ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.
https://convokit.cornell.edu/documentation/
MIT License
556 stars 129 forks source link

Db Storage #140

Closed jschluger closed 2 years ago

jschluger commented 3 years ago

Implementing abstract storage for convokit: all Corpus and CorpusComponent objects now have a self.storage instance variable referring to a convokit.StorageManager object which...manages the storage for that object. Within the StorageManager class, two distinct storage options are abstracted away from the end user: in memory storage (what Convokit has always provided) and database storage. I implement the DBCollectionMapping, DBDocumentMapping, MemCollectionMapping, and MemDocumentMapping classes extending the MutableMapping interface to store collections of items, and the data for a single item, in the two storage modes.

oscarso2000 commented 2 years ago

I agree the list works. :)

oscarso2000 commented 2 years ago

I will take a look at other files tonight and possibly corpus.dump.

calebchiam commented 2 years ago

Fixed the list(convo.iter_speakers()) (described here: https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/pull/140#issuecomment-1139245534) issue with this commit https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/pull/140/commits/66f8dccdbbb2e6afd39e14306a20e3f764c292a8.

owner was not being passed when initializing the Conversation object.

calebchiam commented 2 years ago

vector_demo.ipynb still failing because vectors are not getting pre-loaded in memory mode.

calebchiam commented 2 years ago

Haven't fully diagnosed what's going on with vectors, but I was able to determine that something's wrong with how storage is being referenced throughout the code. There seem to be multiple instances of storage and (since each storage contains a ConvoKitIndex) multiple instances of storage.index. Updates to one index do not update other indices, resulting in very strange behavior where a vector is present in one index and not another.

Overall, there should only be one ConvoKitIndex and one storage per Corpus.

But bizarrely enough, every Corpus component when initialized will initialize its own storage.

Consider this simple example:

from convokit import Utterance, Corpus, Speaker
john = Speaker(id='John')
mary = Speaker(id='Mary')
utt1 = Utterance(id='a', text='hey', speaker=john)
utt2 = Utterance(id='b', text='hey yourself', speaker=mary)
corpus_1 = Corpus(utterances=[utt1, utt2])

If we print the __dict__ for the utt1, utt2, and corpus_1 objects to look at the internals, we get:

utt1.dict

{'config_fullpath': '/Users/calebchiam/.convokit/config.yml',
 'storage_type': 'mem',
 'index': {'utterances-index': {}, 'speakers-index': {}, 'conversations-index': {}, 'overall-index': {}, 'version': 0, 'vectors': []},
 'data_directory': '/Users/calebchiam/.convokit/saved-corpora',
 'connection': {'utterances': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f698c10>,
  'conversations': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bd60>,
  'speakers': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68b070>,
  'metas': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bca0>},
 'corpus_id': None,
 'CollectionMapping': <function convokit.storage.memory_mappings.MemCollectionMapping.with_storage.<locals>.ret(collection_name, item_type=None)>,
 'ItemMapping': convokit.storage.memory_mappings.MemDocumentMapping,
 'raw_version': '0',
 'version': '0',
 'utterances': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f698c10>,
 'conversations': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f698c10>,
 'speakers': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f698c10>,
 'metas': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f698c10>}

utt2.dict

{'config_fullpath': '/Users/calebchiam/.convokit/config.yml',
 'storage_type': 'mem',
 'index': {'utterances-index': {}, 'speakers-index': {}, 'conversations-index': {}, 'overall-index': {}, 'version': 0, 'vectors': []},
 'data_directory': '/Users/calebchiam/.convokit/saved-corpora',
 'connection': {'utterances': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bdc0>,
  'conversations': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bfd0>,
  'speakers': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68b130>,
  'metas': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bcd0>},
 'corpus_id': None,
 'CollectionMapping': <function convokit.storage.memory_mappings.MemCollectionMapping.with_storage.<locals>.ret(collection_name, item_type=None)>,
 'ItemMapping': convokit.storage.memory_mappings.MemDocumentMapping,
 'raw_version': '0',
 'version': '0',
 'utterances': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bdc0>,
 'conversations': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bfd0>,
 'speakers': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68b130>,
 'metas': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bcd0>}

corpus_1.dict

{'config_fullpath': '/Users/calebchiam/.convokit/config.yml',
 'storage_type': 'mem',
 'index': {'utterances-index': {}, 'speakers-index': {}, 'conversations-index': {}, 'overall-index': {}, 'version': 0, 'vectors': []},
 'data_directory': '/Users/calebchiam/.convokit/saved-corpora',
 'connection': {'utterances': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bb50>,
  'conversations': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f6a12e0>,
  'speakers': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f6a10a0>,
  'metas': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f6a11f0>},
 'corpus_id': None,
 'CollectionMapping': <function convokit.storage.memory_mappings.MemCollectionMapping.with_storage.<locals>.ret(collection_name, item_type=None)>,
 'ItemMapping': convokit.storage.memory_mappings.MemDocumentMapping,
 'raw_version': 0,
 'version': 0,
 'utterances': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bb50>,
 'conversations': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f6a12e0>,
 'speakers': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f6a10a0>,
 'metas': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f6a11f0>}

All 3 objects each point to objects at diff locations in memory, instead of the same storage object. I'm not sure what the motivation behind this is, but this seems p wasteful and surely must be a mistake? I'll rewrite how storage gets initialized for corpus objects soon if my understanding is correct here.

jpwchang commented 2 years ago

@calebchiam I actually noticed that multiple-index thing as well when I was trying to debug the vectors notebook - it was in fact part of what made it so hard to debug, since I was frequently confused about which index object I should be thinking about at various points in the code. I don't know for sure how this came about, but my intuition aligns with yours that this must surely be unintended.

jpwchang commented 2 years ago

@calebchiam Could I propose undoing your recent change to make in_place = False the default? Your provided reasoning was "Keeps it consistent with mem -- changes to the Corpus should not affect the underlying storage unless specified otherwise." but my understanding was that we are specifically motivating DB mode as being different from memory mode, in that changes are meant to be immediate and persistent. Having in_place = False actually leads to some counterintuitive behavior where loading a DB-backed Corpus after previously making changes might not load the changed version, due to implicit versioning behind the scenes - @oscarso2000 can weigh in on that whenever he's available (though right now I expect that he's taking a well deserved break!), but I can also answer some questions about that since I was involved in debugging the issue when it first came up.

calebchiam commented 2 years ago

@jpwchang Yeah, happy to discuss this and I might be missing some context. Is there a reason we cannot or wouldn't want to think of DB and memory modes to be essentially the same except for how the data is being stored? There are reasons to use DB mode beyond having changes that are immediate and persistent after all, such as memory efficiency.

The way I see it, a user might find that loading a Corpus into memory consumes too much memory and then decide to use DB mode. They might then make changes to the Corpus just for the sake of experimentation without intending to persist these changes (since the user was able to do that in memory mode), but be caught by surprise that changing the mode results in any experimentations they do permanently mutating the data. This IMO is a fair reaction, especially if the user intends to use DB mode just for memory efficiency reasons, and is what my intuitive expectation was, as a developer. If I wanted to make changes in-place, I should have to specify that I want that behavior. What do you think?

jpwchang commented 2 years ago

@calebchiam While memory efficiency is a nice theoretical upshot of DB mode (albeit, I should add, one that has yet to be rigorously tested), the original impetus behind Charlotte's work that led to this pull request was a very different motivation: the use case where a user wants to use ConvoKit for real-time tracking and retrieval of conversational data, e.g., a live feed of Reddit (anecdotally this is a use case I've gotten requests for, so it seems there is a demand beyond Charlotte's immediate circumstance). This is also the use case that highlighted in Oscar's newly added documentation about picking storage modes. In this situation the user presumably wants and expects changes (i.e., newly added Utterances) to be immediately reflected in storage. This also tracks better with my own intuitions about using a DB in general; a database-backed software where changes aren't automatically recorded feels counterintuitive to me (but this of course is much more subjective). Finally, given the overhead of DB-backed storage I don't think DB mode lends itself well to the kind of "experimentally messing around with the data" that you describe, for which memory mode is better suited. In other words, mem and DB to me have different, non-overlapping primary use cases and we shouldn't try to make them behave the same when doing so is not necessarily what's best for those use cases.

But happy to discuss further outside this increasingly crowded thread.

jpwchang commented 2 years ago

I should also note that in_place = False is to some degree a misnomer. In reality, the nature of the MongoDB implementation is such that there is no such thing as "in place" the way there is in memory mode; all changes are immediately and persistently saved to storage somewhere. What in_place = False actually does is stick those changes in an automatically created fork of the original database - something that happens behind the scenes and can lead to the kind of versioning confusion I alluded to previously. We may of course discuss whether this is desirable behavior and ought to be changed, but this would go beyond just changing the default value of a single parameter.

calebchiam commented 2 years ago

@jpwchang Hmm okay that's fair. I've not read the updated docs in detail, but if we make it clear there that they have two different primary purposes / use cases (i.e. DB is for persistence and not for memory efficiency), then I think that's fine.

We should take a second look at in_place post-PR merge, because if it works as intended, then I still think we may as well keep the behaviors for mem and db consistent. Assuming it works, we inconvenience some group of users regardless of which setting we choose. If it's true by default, we surprise / inconvenience users who want memory efficiency. If it's false by default, we surprise / inconvenience users who want persistent changes. I'm in favor of not surprising the first group, since in the worst case the first group loses their original data, while the second group just fails to make intended updates.

tl;dr: We can change it back for now, but if memory efficiency is a real benefit of the DB architecture and in_place actually works, then we should set in_place=False by default.

jpwchang commented 2 years ago

I noticed that the data_dir parameter in convokit.download got renamed to data_directory, is there any particular reason for this? To me it seems like the only thing this accomplishes is making the user type more; "dir" is fairly standard shorthand for "directory" so I don't think the old name was particularly unclear. I imagine this will end up breaking a lot of people's code for (IMO) relatively little gain.

calebchiam commented 2 years ago

@jpwchang It shouldn't break anyone's existing code because data_directory is a new argument introduced in this PR that replaces base_path.

I don't feel strongly about data_dir vs. data_directory but in general, I think we should favor explicit long names over short forms for clarity. (And we can write this down in a Style Guide going forward.) Short forms, in general, mean that we have to hope that the user has some base level of experience to know offhand what the short form stands for.

Since it's a new variable, I'm just favoring the long form. I do agree with you thought that it wouldn't be worthwhile to break existing API just to use the long form. What do you think?

EDIT: Oops, I was thinking of corpus.dump, but you're right it would break convokit.download. I can change it back.

jpwchang commented 2 years ago

@calebchiam in that case how do you want to handle the renamed base_path in corpus.dump? If we are changing the parameter in download back to data_dir, it seems like it will be confusing that download uses the short form while dump uses the long form.

calebchiam commented 2 years ago

@jpwchang Yup, I'd change it back to data_dir for both

calebchiam commented 2 years ago

@all-contributors please add @jschluger for code

allcontributors[bot] commented 2 years ago

@calebchiam

I've put up a pull request to add @jschluger! :tada:

calebchiam commented 2 years ago

@all-contributors please add @oscarso2000 for code

allcontributors[bot] commented 2 years ago

@calebchiam

I've put up a pull request to add @oscarso2000! :tada:

calebchiam commented 2 years ago

Work on this will be carried out in smaller PRs.