CornellNLP / ConvoKit

ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.
https://convokit.cornell.edu/documentation/
MIT License
556 stars 129 forks source link

Metadata deepcopy #195

Closed seanzhangkx8 closed 1 year ago

seanzhangkx8 commented 1 year ago

Description and Motivation

Main Modification: convokit/model/convoKitMeta.py

With the new version of ConvoKit supporting DB mode, the behavior of corpus metadata between DB and MEM mode are not aligned due to the the fact that all operations in MongoDB involve copying data from the MongoDB database to the Python process (or vice versa), making mutation to mutable datatype metadata fields unable to get correctly updated to DB, causing data loss. Thus, we would force all metadata values to be treated as immutable in order to make metadata behavior globally consistent across different modes.

In light of this, we specifically deep copy metadata fields that are not common immutable datatypes when user is accessing metadata fields. Thus, instead of returning a pointer to the storage location (in MEM mode), we would return a copy of that metadata field, and any mutation to the copy would not be reflected in the corpus metadata storage.

For example, suppose the metadata entry "foo" is a list type, we do saved_foo = my_utt.meta["foo"], and now saved_foo would be a deep copy of my_utt.meta["foo"], and if we do saved_foo.append("new value"), no error would occur, but my_utt.meta["foo"] would not be modified, only the copy of it saved_foo is changed.

Note that this does not affect replacing the entire metadata field, if we do my_utt.meta["foo"] = 1, the system would work as intended.

Also, note that the test: convokit/tests/phrasing_motifs/test_questionSentences.py is using mutability of metadata fields when creating test corpus. We fixed it accordingly.

How has this been tested?

Passed all unit tests.

Other information