SpectData / MONAH_Earnings_Call

0 stars 0 forks source link

Investigate Financial Data Set on sec.gov #4

Open joshkyh opened 3 years ago

joshkyh commented 3 years ago

https://www.sec.gov/dera/data/financial-statement-data-sets.html

A Graph Mining Approach to Identify Financial Reporting Patterns: An Empirical Examination of Industry Classifications This paper uses the above dataset to represent financial statements as trees, and trees can be seen as graphs: image

joshkyh commented 3 years ago

Hi Marriane,

(1) Find out how to filter the large num file, and upload only the company tickers that we care about in earningscall mp3s. The readme file, when unzipped has one reference to ticker. The only method Joshua know from his 2 minute analysis is that it is part of the abcd-date.xml column within the SUB file. abcd is usually the ticker symbol. We will need to do a reconcialiation to see if we caught all of our tickers required for earnings call.

(2) Do the above for the date range of the MAEC (count by quarters because SEC stores their files by YYYY-QQ)

mfmakahiya commented 3 years ago

Hi Joshua! I think joining SUB file to NUM file through adsh solves our problem on filtering only the company tickers we need. It seems that a unique ADSH is assigned for each company per quarter per filing. The challenge I think lies on filtering several ADSH in a text file from our target filing and target company tickers in one go as doing it manually in a text file is tedious.

joshkyh commented 3 years ago

Hi Marriane,

Thanks for the update.

After the SUB s joins NUM n on s.adsh = n.adsh, what is the WHERE condition to filter for the company tickers we need?

In that case, do you recommend working on the storage blob #5 ticket first so that you have access to all earnings call data so that you know what is the universe set of tickers and dates?

mfmakahiya commented 3 years ago

We need to filter the data thru "instance" column of SUB since it's the only column that has the ticker data/ticker symbol.

I'm also done with the script to uploading the folders to azure.

joshkyh commented 3 years ago

Very good. The next problem I have on this is: How do we link up all the entities within NUM such that is it presented like a Balance Sheet or Income Statement? Currently all entities (rows) float around with no relationship amongst them. Two examples of relationships: Total Assets = Total Liabilities + Total Equities Total Liabilities = Short-term Liabililties + Long-term Liabilities + Other Liabilities

(1) Please have a look at deci.12345.pdf They explained that financial statements can be represented as a graph, with nodes being the entities of financial statements.

(2) Given that this SEC dataset is available to all, there must be works done to link the entities up.

(3) Last resort, this is me not doing any googling, brute force for loop pair two entities up and see if they add up to a third entity. Terrible last resort.

joshkyh commented 3 years ago

You've attempted to identify popular entities names to see if you can match the earnings call to the financial statement item.

We discussed that this problem of linking items can be better solved with existing open source python libraries. Google Python Github XBRL Python Github SEC filings etc.