Closed jschluger closed 2 years ago
I agree the list works. :)
I will take a look at other files tonight and possibly corpus.dump.
Fixed the list(convo.iter_speakers())
(described here: https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/pull/140#issuecomment-1139245534) issue with this commit https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/pull/140/commits/66f8dccdbbb2e6afd39e14306a20e3f764c292a8.
owner
was not being passed when initializing the Conversation
object.
corpus.dump
seems to be fixed with my Conversation init fix.in_place = False
by default for Corpus initialization. Keeps it consistent with mem
-- changes to the Corpus should not affect the underlying storage unless specified otherwise.corpus_id
init in corpus.py
to avoid messages like: Loading corpus None from disk at ./cornell-no-bow
vector_demo.ipynb
still failing because vectors are not getting pre-loaded in memory mode.
Haven't fully diagnosed what's going on with vectors, but I was able to determine that something's wrong with how storage
is being referenced throughout the code. There seem to be multiple instances of storage
and (since each storage contains a ConvoKitIndex
) multiple instances of storage.index
. Updates to one index do not update other indices, resulting in very strange behavior where a vector is present in one index and not another.
Overall, there should only be one ConvoKitIndex
and one storage
per Corpus.
But bizarrely enough, every Corpus component when initialized will initialize its own storage.
Consider this simple example:
from convokit import Utterance, Corpus, Speaker
john = Speaker(id='John')
mary = Speaker(id='Mary')
utt1 = Utterance(id='a', text='hey', speaker=john)
utt2 = Utterance(id='b', text='hey yourself', speaker=mary)
corpus_1 = Corpus(utterances=[utt1, utt2])
If we print the __dict__
for the utt1
, utt2
, and corpus_1
objects to look at the internals, we get:
utt1.dict
{'config_fullpath': '/Users/calebchiam/.convokit/config.yml',
'storage_type': 'mem',
'index': {'utterances-index': {}, 'speakers-index': {}, 'conversations-index': {}, 'overall-index': {}, 'version': 0, 'vectors': []},
'data_directory': '/Users/calebchiam/.convokit/saved-corpora',
'connection': {'utterances': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f698c10>,
'conversations': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bd60>,
'speakers': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68b070>,
'metas': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bca0>},
'corpus_id': None,
'CollectionMapping': <function convokit.storage.memory_mappings.MemCollectionMapping.with_storage.<locals>.ret(collection_name, item_type=None)>,
'ItemMapping': convokit.storage.memory_mappings.MemDocumentMapping,
'raw_version': '0',
'version': '0',
'utterances': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f698c10>,
'conversations': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f698c10>,
'speakers': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f698c10>,
'metas': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f698c10>}
utt2.dict
{'config_fullpath': '/Users/calebchiam/.convokit/config.yml',
'storage_type': 'mem',
'index': {'utterances-index': {}, 'speakers-index': {}, 'conversations-index': {}, 'overall-index': {}, 'version': 0, 'vectors': []},
'data_directory': '/Users/calebchiam/.convokit/saved-corpora',
'connection': {'utterances': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bdc0>,
'conversations': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bfd0>,
'speakers': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68b130>,
'metas': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bcd0>},
'corpus_id': None,
'CollectionMapping': <function convokit.storage.memory_mappings.MemCollectionMapping.with_storage.<locals>.ret(collection_name, item_type=None)>,
'ItemMapping': convokit.storage.memory_mappings.MemDocumentMapping,
'raw_version': '0',
'version': '0',
'utterances': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bdc0>,
'conversations': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bfd0>,
'speakers': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68b130>,
'metas': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bcd0>}
corpus_1.dict
{'config_fullpath': '/Users/calebchiam/.convokit/config.yml',
'storage_type': 'mem',
'index': {'utterances-index': {}, 'speakers-index': {}, 'conversations-index': {}, 'overall-index': {}, 'version': 0, 'vectors': []},
'data_directory': '/Users/calebchiam/.convokit/saved-corpora',
'connection': {'utterances': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bb50>,
'conversations': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f6a12e0>,
'speakers': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f6a10a0>,
'metas': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f6a11f0>},
'corpus_id': None,
'CollectionMapping': <function convokit.storage.memory_mappings.MemCollectionMapping.with_storage.<locals>.ret(collection_name, item_type=None)>,
'ItemMapping': convokit.storage.memory_mappings.MemDocumentMapping,
'raw_version': 0,
'version': 0,
'utterances': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f68bb50>,
'conversations': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f6a12e0>,
'speakers': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f6a10a0>,
'metas': <convokit.storage.memory_mappings.MemCollectionMapping at 0x7fea0f6a11f0>}
All 3 objects each point to objects at diff locations in memory, instead of the same storage object. I'm not sure what the motivation behind this is, but this seems p wasteful and surely must be a mistake? I'll rewrite how storage gets initialized for corpus objects soon if my understanding is correct here.
@calebchiam I actually noticed that multiple-index thing as well when I was trying to debug the vectors notebook - it was in fact part of what made it so hard to debug, since I was frequently confused about which index object I should be thinking about at various points in the code. I don't know for sure how this came about, but my intuition aligns with yours that this must surely be unintended.
@calebchiam Could I propose undoing your recent change to make in_place = False
the default? Your provided reasoning was "Keeps it consistent with mem
-- changes to the Corpus should not affect the underlying storage unless specified otherwise." but my understanding was that we are specifically motivating DB mode as being different from memory mode, in that changes are meant to be immediate and persistent. Having in_place = False
actually leads to some counterintuitive behavior where loading a DB-backed Corpus after previously making changes might not load the changed version, due to implicit versioning behind the scenes - @oscarso2000 can weigh in on that whenever he's available (though right now I expect that he's taking a well deserved break!), but I can also answer some questions about that since I was involved in debugging the issue when it first came up.
@jpwchang Yeah, happy to discuss this and I might be missing some context. Is there a reason we cannot or wouldn't want to think of DB and memory modes to be essentially the same except for how the data is being stored? There are reasons to use DB mode beyond having changes that are immediate and persistent after all, such as memory efficiency.
The way I see it, a user might find that loading a Corpus into memory consumes too much memory and then decide to use DB mode. They might then make changes to the Corpus just for the sake of experimentation without intending to persist these changes (since the user was able to do that in memory mode), but be caught by surprise that changing the mode results in any experimentations they do permanently mutating the data. This IMO is a fair reaction, especially if the user intends to use DB mode just for memory efficiency reasons, and is what my intuitive expectation was, as a developer. If I wanted to make changes in-place, I should have to specify that I want that behavior. What do you think?
@calebchiam While memory efficiency is a nice theoretical upshot of DB mode (albeit, I should add, one that has yet to be rigorously tested), the original impetus behind Charlotte's work that led to this pull request was a very different motivation: the use case where a user wants to use ConvoKit for real-time tracking and retrieval of conversational data, e.g., a live feed of Reddit (anecdotally this is a use case I've gotten requests for, so it seems there is a demand beyond Charlotte's immediate circumstance). This is also the use case that highlighted in Oscar's newly added documentation about picking storage modes. In this situation the user presumably wants and expects changes (i.e., newly added Utterances) to be immediately reflected in storage. This also tracks better with my own intuitions about using a DB in general; a database-backed software where changes aren't automatically recorded feels counterintuitive to me (but this of course is much more subjective). Finally, given the overhead of DB-backed storage I don't think DB mode lends itself well to the kind of "experimentally messing around with the data" that you describe, for which memory mode is better suited. In other words, mem and DB to me have different, non-overlapping primary use cases and we shouldn't try to make them behave the same when doing so is not necessarily what's best for those use cases.
But happy to discuss further outside this increasingly crowded thread.
I should also note that in_place = False
is to some degree a misnomer. In reality, the nature of the MongoDB implementation is such that there is no such thing as "in place" the way there is in memory mode; all changes are immediately and persistently saved to storage somewhere. What in_place = False
actually does is stick those changes in an automatically created fork of the original database - something that happens behind the scenes and can lead to the kind of versioning confusion I alluded to previously. We may of course discuss whether this is desirable behavior and ought to be changed, but this would go beyond just changing the default value of a single parameter.
@jpwchang Hmm okay that's fair. I've not read the updated docs in detail, but if we make it clear there that they have two different primary purposes / use cases (i.e. DB is for persistence and not for memory efficiency), then I think that's fine.
We should take a second look at in_place
post-PR merge, because if it works as intended, then I still think we may as well keep the behaviors for mem and db consistent. Assuming it works, we inconvenience some group of users regardless of which setting we choose. If it's true by default, we surprise / inconvenience users who want memory efficiency. If it's false by default, we surprise / inconvenience users who want persistent changes. I'm in favor of not surprising the first group, since in the worst case the first group loses their original data, while the second group just fails to make intended updates.
tl;dr: We can change it back for now, but if memory efficiency is a real benefit of the DB architecture and in_place
actually works, then we should set in_place=False
by default.
I noticed that the data_dir
parameter in convokit.download
got renamed to data_directory
, is there any particular reason for this? To me it seems like the only thing this accomplishes is making the user type more; "dir" is fairly standard shorthand for "directory" so I don't think the old name was particularly unclear. I imagine this will end up breaking a lot of people's code for (IMO) relatively little gain.
@jpwchang It shouldn't break anyone's existing code because data_directory
is a new argument introduced in this PR that replaces base_path
.
I don't feel strongly about data_dir
vs. data_directory
but in general, I think we should favor explicit long names over short forms for clarity. (And we can write this down in a Style Guide going forward.) Short forms, in general, mean that we have to hope that the user has some base level of experience to know offhand what the short form stands for.
Since it's a new variable, I'm just favoring the long form. I do agree with you thought that it wouldn't be worthwhile to break existing API just to use the long form. What do you think?
EDIT: Oops, I was thinking of corpus.dump
, but you're right it would break convokit.download
. I can change it back.
@calebchiam in that case how do you want to handle the renamed base_path
in corpus.dump
? If we are changing the parameter in download
back to data_dir
, it seems like it will be confusing that download
uses the short form while dump
uses the long form.
@jpwchang Yup, I'd change it back to data_dir
for both
@all-contributors please add @jschluger for code
@calebchiam
I've put up a pull request to add @jschluger! :tada:
@all-contributors please add @oscarso2000 for code
@calebchiam
I've put up a pull request to add @oscarso2000! :tada:
Work on this will be carried out in smaller PRs.
Implementing abstract storage for convokit: all Corpus and CorpusComponent objects now have a self.storage instance variable referring to a convokit.StorageManager object which...manages the storage for that object. Within the StorageManager class, two distinct storage options are abstracted away from the end user: in memory storage (what Convokit has always provided) and database storage. I implement the DBCollectionMapping, DBDocumentMapping, MemCollectionMapping, and MemDocumentMapping classes extending the MutableMapping interface to store collections of items, and the data for a single item, in the two storage modes.