Closed joshkyh closed 3 years ago
@mfmakahiya
Hi Marriane,
I'm sorry for the disruption, but I have received the files for
MAEC: A Multimodal Aligned Earnings Conference Call Dataset for Financial Risk Prediction
Please read their paper too. I like this paper slightly better because of (1) the size and (2) they included everyone on the management team, not just the CEO as per Lin, (but not the equity analysts/ questioners which I would like). Nevertheless, I am going to pivot us to work on this new MAEC dataset. For now, just assume that there is one speaker, the "executive" said, "blabla". Although there are multiple managers in the transcript, I don't think it is easy to identify each manager -- anyway, I'm less interested about improving MONAH textual transcript now, but more interested in integrating external information via graphs.
MAEC dataset, when zipped, is 97 gb (!). However, to help you pivot immediately, please find https://spectdata-my.sharepoint.com/:u:/p/joshua/ES5JNqKeR09LkVmIMdkfGJcBpM8ITdUk__pu2NyqSs26-g?e=jjDgLQ The tiny version of the folder structure, I have included only 2 earnings call in this zip file. Please work with these two companies for now in terms of this project. To give you a glimpse of the full dataset folder structure, please see below:
I will upload all folders to Azure Blob storage (details later), unzipped so that you can iteratively download and delete files after processing (just like how you have done for previous projects).
Also, don't worry too much about delay, I don't think we can get it easily. And my focus on this paper is not going to be about building a better MONAH, but integrating external information to each talkturn. This is my current thinking, I'm happy with the verbatim transcript annotated with Vokaturi & speech rate, and not more (delay excluded).
Speech to text done.
Download the entire data set. https://github.com/GeminiLn/EarningsCall_Dataset
And get verbatim and store it in the database.