Closed sophieball closed 4 years ago
Since your dataset is a lot smaller than the one used in the demo, I wonder if it's because the utterances which are unassigned contain motifs that are too infrequent. A few things that could help:
increase the minimum frequency that motifs occur for them to be considered: see
https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/1d7014ec4e49647b9e6a4c7642a755506c08e131/convokit/prompt_types/promptTypes.py#L62 the prompt__tfidf_min_df
argument (check out the other *_min_df) arguments . Another way to diagnose this might be to inspect the motifs that are in each utterance that's unassigned, and seeing if these are indeed super infrequent.
Try out the code using just dependency parse arcs, instead of motifs (which might have a better chance of showing up in larger, more linguistically uniform datasets like the parliamentary questions, than in conversations that take place in a more informal, online setting). Cell #72 in the notebook you linked suggests how. The call to PromptTypes in https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/conversations-gone-awry/Conversations_Gone_Awry_Prediction.ipynb might also be informative, since that deals with internet data too.
They did help! Thanks!
Hi! I wanted to replicate this paper on GitHub pull request comments data from ghtorrent.org. In my dataset, each conversation is a pull request and each utterance is a comment in the PR (including the initial PR description). For meta info, I have the
author login
,author's association
with the project (Contributor, Member, Author, etc), andcreated_at
, along withreply_to
and conversation id.What I don't quite understand is that after I fit and transform the pt model on the same dataset, not all utterances get their distance vectors. For example, I have 8017 utterances, but only 2061 of them have distance information.
I tried to use
transform_utterance()
on each individual utterance but I got errorBelow is part of my code (after following this example to create the corpus and parsing the text):
Please let me know if I need to provide more of my code. Thanks in advance!