current12 / Stat-222-Project

3 stars 0 forks source link

Data Load and Exploratory Data Analysis on Earnings Calls #7

Closed ijyliu closed 6 months ago

ijyliu commented 6 months ago

Information about the dataset: https://www.kaggle.com/datasets/v1ctor10/earnings-call-nlp-strategy-v2. Source: ROIC.ai

Test all code for the below on a sample of 1-2 calls from different companies and years before rolling out to all data.

ijyliu commented 6 months ago

Slack 2024-02-16

Owen Lin 10:25 AM Okay so for the earning calls, this is the situation: It took about 30 minutes for extracting 1000 files (one of the smaller sector) on my computer The developer token only last for 1 hour. There about 1~2 files in each section that can't be read and store automatically by my code, by far each required a different modification. I am thinking about extract section by section before we merge them. For time constraint this week, let me upload the automobile part first and some basic summary on this sector. (edited)

Isaac Liu 10:34 AM You probably already did this, but did you check they are all downloaded and local before running python? By setting the entire Box folder to make available offline. don't spend too much time on this if pip fails to install it cleanly, but here's a replacement for pandas that's possibly faster: https://pypi.org/project/modin/. you can also probably use bash or Linux. I don't think we need anything more than the date and the earnings call text, you can probably read in only the first 2-3 rows as a direct argument to read_csv or whatever you end up using You can also skip all the speaker names files in every folder (edited)

Owen Lin 1 hour ago thanks for the reference! and yeah, so I am actually pulling data directly from the Box with API. (now looking at the size of the folder, I should probably just download it and make it easier lol) I will look more into these but I will do summary stat on automobile sector for today 6 replies

Isaac Liu 1 hour ago yeah api is probably limited by your internet speed. which can be a real bottleneck with all these files, it took me like a whole day to upload all the unzipped ones hopefully once everyone has everything downloaded we can minimize this (edited)

Owen Lin 1 hour ago That's the reason I didn't think about downloading it first:smiling_face_with_tear:

Isaac Liu 1 hour ago I think downloads will be pretty easy. Uploads are usually trash tho

Owen Lin 1 hour ago yup, I will save the result dictionary in pkl file as well, so we don't need to extract every single time

Owen Lin 1 hour ago True

Isaac Liu 1 hour ago just let it download everything in the background and it probably won't take more than 8 hrs (edited)

ijyliu commented 6 months ago

@OwenLin2001 are you creating the list of unique companies by earnings call dates separately, or just creating the big dataset that also includes the call text

OwenLin2001 commented 6 months ago

I am creating a data frame of unique companies by earnings call dates. Something like

ijyliu commented 6 months ago

right, but will also need to have a dataset of just the company and earnings call date variables to use for preparing the credit rating data. if you already have the big dataset loaded into memory this might be easy to produce as a side product. if not, either you or I can load the whole big dataset including transcripts and select these two columns and save

https://github.com/current12/Stat-222-Project/issues/7#issue-2127554673 item 2

OwenLin2001 commented 6 months ago

By earnings call date, do you need year and quarter column like 2010, 1 or the precise date (2016-10-28-17:00:00)? I have the big dataset loaded and I can pull the two columns down

ijyliu commented 6 months ago

date: yyyy-mm-dd

don't think there will be duplicates on date by company but may want to check

On Thu, Feb 22, 2024 at 11:40 AM OwenLin2001 @.***> wrote:

By earnings call date, do you need year and quarter column like 2010, 1 or the precise date (2016-10-28-17:00:00)? I have the big dataset loaded and I can pull the two columns down

— Reply to this email directly, view it on GitHub https://github.com/current12/Stat-222-Project/issues/7#issuecomment-1960129638, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQCGE4OQQP4TEMKLAO4ZBPLYU6NLBAVCNFSM6AAAAABDBZ3LWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRQGEZDSNRTHA . You are receiving this because you authored the thread.Message ID: @.***>

OwenLin2001 commented 6 months ago

I made a new commit to earning calls folder, there is a csv file called call_for_merge that has "company" and "date" as its feature.

ijyliu commented 6 months ago

Thanks. Can we store all data in "~/Box/STAT 222 Capstone"? This will avoid file size limit problems on github and ensure everyone is using the same data, and all code is runnable on everyone's machine

ijyliu commented 6 months ago

@OwenLin2001 can you investigate this issue with earnings call data?

the following is a load of calls.csv. in the bottom two rows the quarter and year don't line up well at all with the call date. from what I understand year and quarter are the time period the call results are for, and date is the date of the call. so date should always be later than the year and quarter, and by less than 90 days. but that doesn't seem to be the case in at least several observations. and sometimes the earnings call date is before the year and quarter the results are supposed to be for!

image

can you explore the extent of this issue - how many calls are affected - and also look through the documentation at the kaggle link to see if we are interpreting the variables correctly?

it may help to figure out the year and quarter the earnings call is conducted in (year and quarter of date). you could divide quarter by 4 and subtract .25 and add that to year for both earnings date year and quarter and statement year and quarter, and then subtract those so we know what fraction of a year lead or lag there is. or, alternatively, you get the last date in the statement year/quarter and compare that to the earnings call date (difference of days).

in the longer term if there is dispute between the variables, we might just pick the earnings call date as ground truth, and then assign year/quarter as being that for most recent full year/quarter before the call.

we should do this investigation on the full/un-joined earnings call data, because whether or not the dates are correct will affect all of the joins down the line

ijyliu commented 6 months ago

Owen Lin:

:ok_hand: working on it. I also looked at the original raw data and confirmed that I didn't miss-match the year/quarter with date. The issue seems to lie in the raw data itself. We can either drop bad rows (recommended), drop the date column in earnings call (not a good practice), or scape more data (time-consuming).

ijyliu commented 6 months ago

Of those only dropping bad earnings calls is of potential interest.

Or we could force the year and quarter of the results to be before the call.

Is the variable we have been using the only variable for earnings call date in the data? Same for year and quarter?

How many observations have the call before the year and quarter and by how much? Once we have those sum stats we could either drop/modify all items that have year and quarter before call, or pick some cutoff in terms of the gap.

ijyliu commented 6 months ago

Of those only dropping bad earnings calls is of potential interest.

Or we could force the year and quarter of the results to be before the call.

Is the variable we have been using the only variable for earnings call date in the data? Same for year and quarter?

How many observations have the call before the year and quarter and by how much? Once we have those sum stats we could either drop/modify all items that have year and quarter before call, or pick some cutoff in terms of the gap.

I'd also pick a call with a big mismatch in the raw data and skim the transcript to see what quarter it's actually for and what date it's on. Surely they would announce this pretty early.

ijyliu commented 6 months ago

i see you already separated calls and a small version of calls out. but if you're re-rerunning (you could just set this up to run overnight at the end of today) i'd suggest using parquet format for storing the big calls file. this will save time since it's columnar (easier to just load a few columns at a time for tasks like this), save storage space, and decrease loading times (even when you load all the columns).

OwenLin2001 commented 6 months ago

The calls_short.csv is our target from 2010-2016 containing ~20000 calls, it takes around 30 seconds to load with read_csv(). I looked into the discrepancy between year/quarter and date, and I find that the transcript match with the year/quarter. So many date provided isn't reliable.

OwenLin2001 commented 6 months ago

Base on what I find, I think the one practice is to 1) join earning call by the quarter and year with tabularfinance 2) construct a uniform date for a given year/date

ijyliu commented 6 months ago

Yes, we are currently matching quarter and year with tabular finance.

If the earnings call date is messed up, that impacts the overall fixed quarter assignment. So we will need to decide to drop or not or pick a cutoff based on the size of gaps between year and quarter and earnings call date variables.

On Thu, Feb 29, 2024, 10:21 AM OwenLin2001 @.***> wrote:

Base on what I find, I think we need to either match the quarter and year with tabularfinance

— Reply to this email directly, view it on GitHub https://github.com/current12/Stat-222-Project/issues/7#issuecomment-1971704576, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQCGE4MWWAGSV6B36NXISQDYV5YTHAVCNFSM6AAAAABDBZ3LWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZRG4YDINJXGY . You are receiving this because you authored the thread.Message ID: @.***>

ijyliu commented 6 months ago

Pending web scrape, reset earnings_call_date to last day year + quarter + 45 (or something) days

ijyliu commented 6 months ago

superseded