Closed wise-east closed 2 years ago
@cristiandnm Hello Cristian! let me know if there is anything I can help to get this dataset included into ConvoKit :)
Thanks for the dataset and for contributing it to ConvoKit! One minor thing that would be good to change before we include it: I notice that as of now there are only two speakers in the dataset, which can be a bit confusing and might conflict with the intended use of the corpus hierarchy (implying that 2 speakers generated all utterances. Could you create a unique pair of users for each conversation (e.g., speaker_43245_1, speaker_43245_2, where 43245 is the conversation id)?
Let me know if that makes sense.
Cristian
Yes, that makes sense. I made the update according to your suggestions (e.g., speaker_43245_1, speaker_43245_2, where 43245 is the conversation id)
Thanks! we'll add this to the website soon and close this issue at that point
@wise-east Sorry for the delay, and thanks for the patience, we've been busy developing other features for ConvoKit. We'll be adding the dataset today. Could you update the SPOLIN description you shared re: how speakers are named?
Seems confusing at the moment. Some speaker pairs are suffixed by '0' or '1', while others are suffixed by '1', '2'. Is this deliberate?
@calebchiam Hi Caleb. I've updated the description. Sorry for the confusion. All of them should be 0 and 1 for the first and second turns respectively, but for the validation set there was a bug that made them 1 and 2 instead.
I don't have access to the code base that formatted the dataset for the time being, so is it possible that you could simply make this fix for the speaker row such that valid_x_p_1 becomes valid_x_p_0 and valid_x_r_2 becomes valid_x_r_1? If not, I'll make the fix later when I have access and notify you then.
@wise-east Added comments on your Google doc, but it looks like there are some other issues, so please address those and I'll come back to this later. Thanks!
Hi @calebchiam I finally got around to making the updates as you suggested. The drive folder contains the updated dataset and the google doc has been updated with your feedback. Please let me know if you need anything else! Thank you.
@wise-east Was on vacation, so had to take a pause on this.
Last bit of info I'll need for this, what are the sizes of the train and test splits?
Hope you had a good vacation!
Here's the splits: https://github.com/wise-east/spolin
data/spolin-train.json | data/spolin-valid.json |
---|---|
|| yesands| non-yesands| |--|---:|---:| |Spontaneanation|10,459|5,587*| |Cornell|16,426|18,310| |SubTle|40,303|19,512| |Total|67,188|43,409| | || yesands| non-yesands| |--|---:|---:| |Spontaneanation|500|500| |Cornell|500|500| |Total|1,000|1,000| |
Thanks for your contribution, @wise-east!
You can now download the SPOLIN corpus through ConvoKit: Corpus(download('spolin-corpus'))
Your dataset description can be found here: https://convokit.cornell.edu/documentation/spolin.html
If you want to change the description, you can edit the file directly and make a PR.
@all-contributors add @wise-east for data
@calebchiam
I've put up a pull request to add @wise-east! :tada:
Hello, I would like to have the SPOLIN dataset from Grounding Conversations with Improvised Dialogues (ACL 2020) added to ConvoKit.
All the requested information according to the contribution guidelines can be found here: https://drive.google.com/drive/folders/1XjwgEh38N-9MwNwAy1icpMYFMlvUzl_u?usp=sharing
Let me know if you need any help from my part to get this dataset added. Thank you!