CornellNLP / ConvoKit

ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.
https://convokit.cornell.edu/documentation/
MIT License
556 stars 129 forks source link

Initialize default conversation ID for utts missing a conversation ID #178

Closed calebchiam closed 2 years ago

calebchiam commented 2 years ago

Description

Motivation and Context

@jpwchang said:

If I remember correctly, I think that the previous decision was that we will not allow Utterances in a Corpus to literally belong to no Conversation (since the package is, after all, Convokit), and so what currently happens is that those Utterances get assigned to a "dummy" Conversation whose ID is None. However, this turns out to cause problems with dumping to JSON. This is because json.dump() will represent None keys as the string "null". But in utterances.jsonl the Utterances' conversation_ids will still be correctly represented as the JSON null type (which gets interpreted as None in python). This behavior is presumably because the JSON standard allows null values but not null keys. As a result, there will be a mismatch between utterances.jsonl and conversations.json. The former file will have Utterances with conversation IDs that do not exist in the latter file, while conversely, the latter file will have a conversation ID that is not used by any Utterance. Needless to say, if the user had previously assigned metadata to the placeholder conversation, this metadata will not properly get reloaded as a result of this mismatch.

How has this been tested?

Tested through CI.