SpectData / MONAH_Earnings_Call

0 stars 0 forks source link

Google Speech to Text Verbatim #1

Closed joshkyh closed 3 years ago

joshkyh commented 3 years ago

Download the entire data set. https://github.com/GeminiLn/EarningsCall_Dataset

And get verbatim and store it in the database.

joshkyh commented 3 years ago

@mfmakahiya

Hi Marriane,

I'm sorry for the disruption, but I have received the files for

MAEC: A Multimodal Aligned Earnings Conference Call Dataset for Financial Risk Prediction

https://dl.acm.org/doi/pdf/10.1145/3340531.3412879?casa_token=dnlAKBHQmoIAAAAA:wxY7uCypXnX11tSxnDysfM5OOwAMLxpWn_gpNH7-pCkeE5n_6sf6ltvg13UZDLCTaOAKDut3gL5iaA

Please read their paper too. I like this paper slightly better because of (1) the size and (2) they included everyone on the management team, not just the CEO as per Lin, (but not the equity analysts/ questioners which I would like). Nevertheless, I am going to pivot us to work on this new MAEC dataset. For now, just assume that there is one speaker, the "executive" said, "blabla". Although there are multiple managers in the transcript, I don't think it is easy to identify each manager -- anyway, I'm less interested about improving MONAH textual transcript now, but more interested in integrating external information via graphs.

MAEC dataset, when zipped, is 97 gb (!). However, to help you pivot immediately, please find https://spectdata-my.sharepoint.com/:u:/p/joshua/ES5JNqKeR09LkVmIMdkfGJcBpM8ITdUk__pu2NyqSs26-g?e=jjDgLQ The tiny version of the folder structure, I have included only 2 earnings call in this zip file. Please work with these two companies for now in terms of this project. To give you a glimpse of the full dataset folder structure, please see below: image

I will upload all folders to Azure Blob storage (details later), unzipped so that you can iteratively download and delete files after processing (just like how you have done for previous projects).

Also, don't worry too much about delay, I don't think we can get it easily. And my focus on this paper is not going to be about building a better MONAH, but integrating external information to each talkturn. This is my current thinking, I'm happy with the verbatim transcript annotated with Vokaturi & speech rate, and not more (delay excluded).

joshkyh commented 3 years ago

Speech to text done.