Closed calebchiam closed 3 years ago
I've picked this up. Will submit a PR soon!
Hi @calebchiam, was hoping you could help with a question I had regarding this issue. I noticed that functionality to read json files for utterances, speakers and conversations and create a corpus is already present. Pandas has an inbuilt "to_json" method which can create those files out of the user's dataframes.
Do you think we could internally create json files from the 3 dfs, read them as a corpus (through corpus directory) as usual? This process could be made into a static method as you suggested. Or do you think this intermediate step of creating json files from dataframes should be avoided? Because if it must be avoided, I think this issue would also have to invariably solve #78. Let me know what you think. Thanks!
Hi @Ap1075, thanks for picking this up. I think we'd want to avoid the intermediate step of creating json files from dataframes because the json file writing process in ConvoKit comes with its own set of complexities and specificities, that would only complicate this from_pandas
method.
Instead, we might expect this method to look something like convert_df_to_corpus() in this piece of code -- albeit slightly more complex because it would have to handle conversation and speaker metadata as well. You would not have to solve #78, since that is more about abstracting the metadata update step into its own method, whereas here you can do something much simpler like the metadata initialisation in L41-43 of the linked code. Does that make sense?
Just to add on, the basic structure of this method would probably look something like:
utt_list
) from the Utterance data frame, using the Speaker object as an initialization argument for each utteranceCorpus(utterances=utt_list)
, which initializes the Corpus for you + the Conversation objects.Might be missing some smaller steps, but that's the rough idea.
Thanks @calebchiam that does clear things out a lot. I think I might've implemented steps 2 and 3 already for a project using the convert_df_to_corpus function as reference. I'll tie it all together. I thought linking metadata across speakers, utterances and conversations might need an update function as suggested in #78. But perhaps that's not necessary. Thanks again!
This could be implemented as a static method in Corpus, i.e.
Corpus.from_pandas(...)
, that takes in three arguments: a speakers dataframe, an utterances dataframe, and a conversations dataframe.The columns of the dataframes should mirror the primary data fields of the respective components exactly. All additional metadata should be specified in columns that are prefixed with 'meta.' For example, an utterance with a subreddit metadata attribute would have a column called 'meta.subreddit' in the utterances dataframe.