Corpus construction from pandas Dataframes

CornellNLP / ConvoKit

ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.

https://convokit.cornell.edu/documentation/

MIT License

556 stars 129 forks source link

Corpus construction from pandas Dataframes #69

Closed calebchiam closed 3 years ago

calebchiam commented 4 years ago

This could be implemented as a static method in Corpus, i.e. Corpus.from_pandas(...), that takes in three arguments: a speakers dataframe, an utterances dataframe, and a conversations dataframe.

The columns of the dataframes should mirror the primary data fields of the respective components exactly. All additional metadata should be specified in columns that are prefixed with 'meta.' For example, an utterance with a subreddit metadata attribute would have a column called 'meta.subreddit' in the utterances dataframe.

Ap1075 commented 3 years ago

I've picked this up. Will submit a PR soon!

Ap1075 commented 3 years ago

Hi @calebchiam, was hoping you could help with a question I had regarding this issue. I noticed that functionality to read json files for utterances, speakers and conversations and create a corpus is already present. Pandas has an inbuilt "to_json" method which can create those files out of the user's dataframes.

Do you think we could internally create json files from the 3 dfs, read them as a corpus (through corpus directory) as usual? This process could be made into a static method as you suggested. Or do you think this intermediate step of creating json files from dataframes should be avoided? Because if it must be avoided, I think this issue would also have to invariably solve #78. Let me know what you think. Thanks!

calebchiam commented 3 years ago

Hi @Ap1075, thanks for picking this up. I think we'd want to avoid the intermediate step of creating json files from dataframes because the json file writing process in ConvoKit comes with its own set of complexities and specificities, that would only complicate this from_pandas method.

Instead, we might expect this method to look something like convert_df_to_corpus() in this piece of code -- albeit slightly more complex because it would have to handle conversation and speaker metadata as well. You would not have to solve #78, since that is more about abstracting the metadata update step into its own method, whereas here you can do something much simpler like the metadata initialisation in L41-43 of the linked code. Does that make sense?

calebchiam commented 3 years ago

Just to add on, the basic structure of this method would probably look something like:

Initialize Speakers from Speaker dataframe
Initialize Utterances (utt_list) from the Utterance data frame, using the Speaker object as an initialization argument for each utterance
Initialize Corpus using Corpus(utterances=utt_list), which initializes the Corpus for you + the Conversation objects.
Add Conversation metadata from dataframe to Conversation objects

Might be missing some smaller steps, but that's the rough idea.

Ap1075 commented 3 years ago

Thanks @calebchiam that does clear things out a lot. I think I might've implemented steps 2 and 3 already for a project using the convert_df_to_corpus function as reference. I'll tie it all together. I thought linking metadata across speakers, utterances and conversations might need an update function as suggested in #78. But perhaps that's not necessary. Thanks again!