Why are the prompt types of some utterances empty?

CornellNLP / ConvoKit

ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.

MIT License

553 stars 126 forks source link

Hi! I wanted to replicate this paper on GitHub pull request comments data from ghtorrent.org. In my dataset, each conversation is a pull request and each utterance is a comment in the PR (including the initial PR description). For meta info, I have the author login, author's association with the project (Contributor, Member, Author, etc), and created_at, along with reply_to and conversation id.

What I don't quite understand is that after I fit and transform the pt model on the same dataset, not all utterances get their distance vectors. For example, I have 8017 utterances, but only 2061 of them have distance information.

I tried to use transform_utterance() on each individual utterance but I got error

sklearn/utils/validation.py", line 651, in check_array
    raise ValueError("Found array with %d sample(s) (shape=%s) while a"
ValueError: Found array with 0 sample(s) (shape=(0, 24)) while a minimum of 1 is required.

Below is part of my code (after following this example to create the corpus and parsing the text):

# phrasing motifs
pm_model = PhrasingMotifs(
    'motifs',
    "root_arcs",
    min_support=3,
    verbosity=1)
pm_model.fit(corpus)

pm_model.transform(corpus)
pm_model.print_top_phrasings(25)

# prompt type
pt = PromptTypes(
    n_types=8,
    prompt_field="motifs",
    ref_field="root_arcs",
    prompt_transform_field="motifs__sink",
    output_field="prompt_types",
    random_state=1000,
    verbosity=1)

pt.fit(corpus)

#corpus = pt.transform(corpus)
for utt_id in corpus.get_utterance_ids():
  utt = corpus.get_utterance(utt_id)
  utt = pt.transform_utterance(utt)

Please let me know if I need to provide more of my code. Thanks in advance!

Since your dataset is a lot smaller than the one used in the demo, I wonder if it's because the utterances which are unassigned contain motifs that are too infrequent. A few things that could help:

increase the minimum frequency that motifs occur for them to be considered: see https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/1d7014ec4e49647b9e6a4c7642a755506c08e131/convokit/prompt_types/promptTypes.py#L62 the prompt__tfidf_min_df argument (check out the other *_min_df) arguments . Another way to diagnose this might be to inspect the motifs that are in each utterance that's unassigned, and seeing if these are indeed super infrequent.
Try out the code using just dependency parse arcs, instead of motifs (which might have a better chance of showing up in larger, more linguistically uniform datasets like the parliamentary questions, than in conversations that take place in a more informal, online setting). Cell #72 in the notebook you linked suggests how. The call to PromptTypes in https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/conversations-gone-awry/Conversations_Gone_Awry_Prediction.ipynb might also be informative, since that deals with internet data too.

CornellNLP / ConvoKit

Why are the prompt types of some utterances empty? #58