[Dataset] Submitting the SPOLIN dataset for ConvoKit

CornellNLP / ConvoKit

ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.

https://convokit.cornell.edu/documentation/

MIT License

542 stars 121 forks source link

[Dataset] Submitting the SPOLIN dataset for ConvoKit #158

Closed wise-east closed 2 years ago

wise-east commented 2 years ago

Hello, I would like to have the SPOLIN dataset from Grounding Conversations with Improvised Dialogues (ACL 2020) added to ConvoKit.

All the requested information according to the contribution guidelines can be found here: https://drive.google.com/drive/folders/1XjwgEh38N-9MwNwAy1icpMYFMlvUzl_u?usp=sharing

Let me know if you need any help from my part to get this dataset added. Thank you!

wise-east commented 2 years ago

@cristiandnm Hello Cristian! let me know if there is anything I can help to get this dataset included into ConvoKit :)

cristiandnm commented 2 years ago

Thanks for the dataset and for contributing it to ConvoKit! One minor thing that would be good to change before we include it: I notice that as of now there are only two speakers in the dataset, which can be a bit confusing and might conflict with the intended use of the corpus hierarchy (implying that 2 speakers generated all utterances. Could you create a unique pair of users for each conversation (e.g., speaker_43245_1, speaker_43245_2, where 43245 is the conversation id)?

Let me know if that makes sense.

Cristian

wise-east commented 2 years ago

Yes, that makes sense. I made the update according to your suggestions (e.g., speaker_43245_1, speaker_43245_2, where 43245 is the conversation id)

cristiandnm commented 2 years ago

Thanks! we'll add this to the website soon and close this issue at that point

calebchiam commented 2 years ago

@wise-east Sorry for the delay, and thanks for the patience, we've been busy developing other features for ConvoKit. We'll be adding the dataset today. Could you update the SPOLIN description you shared re: how speakers are named?

Seems confusing at the moment. Some speaker pairs are suffixed by '0' or '1', while others are suffixed by '1', '2'. Is this deliberate?

wise-east commented 2 years ago

@calebchiam Hi Caleb. I've updated the description. Sorry for the confusion. All of them should be 0 and 1 for the first and second turns respectively, but for the validation set there was a bug that made them 1 and 2 instead.

I don't have access to the code base that formatted the dataset for the time being, so is it possible that you could simply make this fix for the speaker row such that valid_x_p_1 becomes valid_x_p_0 and valid_x_r_2 becomes valid_x_r_1? If not, I'll make the fix later when I have access and notify you then.

calebchiam commented 2 years ago

@wise-east Added comments on your Google doc, but it looks like there are some other issues, so please address those and I'll come back to this later. Thanks!

wise-east commented 2 years ago

Hi @calebchiam I finally got around to making the updates as you suggested. The drive folder contains the updated dataset and the google doc has been updated with your feedback. Please let me know if you need anything else! Thank you.

calebchiam commented 2 years ago

@wise-east Was on vacation, so had to take a pause on this.

Last bit of info I'll need for this, what are the sizes of the train and test splits?

wise-east commented 2 years ago

Hope you had a good vacation!

Here's the splits: https://github.com/wise-east/spolin

data/spolin-train.json	data/spolin-valid.json
\|\| yesands\| non-yesands\| \|--\|---:\|---:\| \|Spontaneanation\|10,459\|5,587*\| \|Cornell\|16,426\|18,310\| \|SubTle\|40,303\|19,512\| \|Total\|67,188\|43,409\|	\|\| yesands\| non-yesands\| \|--\|---:\|---:\| \|Spontaneanation\|500\|500\| \|Cornell\|500\|500\| \|Total\|1,000\|1,000\|

calebchiam commented 2 years ago

Thanks for your contribution, @wise-east!

You can now download the SPOLIN corpus through ConvoKit: Corpus(download('spolin-corpus'))

Your dataset description can be found here: https://convokit.cornell.edu/documentation/spolin.html

If you want to change the description, you can edit the file directly and make a PR.

calebchiam commented 2 years ago

@all-contributors add @wise-east for data

allcontributors[bot] commented 2 years ago

@calebchiam

I've put up a pull request to add @wise-east! :tada: