ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.
When the Corpus is initialized with utterances, we now check that all utterances have some conversation ID. If they are missing a conversation ID, they are either:
assigned one of the form __default_conversation__{root_utt_id}, where root_utt_id is the ID of the root utterance in the Conversation
assigned an existing conversation ID, if the utterance replies to some existing utterance with an existing conversation ID
Implementation note: This check also constructs reply-to chains in order to figure out the root utterance, but the overall time taken is O(n)
Fixes a bug in add_utterances, where the first utterance in a Conversation is assumed to have an Utterance ID == Conversation ID
Fixes test errors for utterances where conversation_id = None
Adds test cases under tests/fill_missing_convo_ids to test this new functionality.
Motivation and Context
@jpwchang said:
If I remember correctly, I think that the previous decision was that we will not allow Utterances in a Corpus to literally belong to no Conversation (since the package is, after all, Convokit), and so what currently happens is that those Utterances get assigned to a "dummy" Conversation whose ID is None. However, this turns out to cause problems with dumping to JSON. This is because json.dump() will represent None keys as the string "null". But in utterances.jsonl the Utterances' conversation_ids will still be correctly represented as the JSON null type (which gets interpreted as None in python). This behavior is presumably because the JSON standard allows null values but not null keys. As a result, there will be a mismatch between utterances.jsonl and conversations.json. The former file will have Utterances with conversation IDs that do not exist in the latter file, while conversely, the latter file will have a conversation ID that is not used by any Utterance. Needless to say, if the user had previously assigned metadata to the placeholder conversation, this metadata will not properly get reloaded as a result of this mismatch.
Description
__default_conversation__{root_utt_id}
, whereroot_utt_id
is the ID of the root utterance in the ConversationO(n)
add_utterances
, where the first utterance in a Conversation is assumed to have an Utterance ID == Conversation IDconversation_id = None
tests/fill_missing_convo_ids
to test this new functionality.Motivation and Context
@jpwchang said:
How has this been tested?
Tested through CI.