current12 / Stat-222-Project

3 stars 0 forks source link

Data Load and Exploratory Data Analysis on Tabular Financial Statement Data #8

Closed ijyliu closed 6 months ago

ijyliu commented 6 months ago

Information about the dataset: https://www.kaggle.com/datasets/finnhub/reported-financials/data

Test all code for the below on a sample of 1-2 statements from different companies and years before rolling out to all data.

image

image

ijyliu commented 6 months ago

here's the actual data dictionary

https://finnhub.io/docs/api/filings-sentiment

ijyliu commented 6 months ago

From Slack 2024-02-16

CHENGZHENGXING 43 minutes ago I have loaded the tabular Financial Statement Data and performed exploratory data analysis, the main problem is that each data file is a JSON file and the labels in it vary. I tried to extract five attributes ['Total Liabilities', 'Total Current Liabilities', 'Total Current Assets',' "Retained Earnings', 'Net Sales'] and there are only 66 files that contain all of them without Null. 2 replies

Isaac Liu 17 minutes ago can you try with the concept field instead of the label field? looking at the data, that seems more consistent i'd also suggest converting to uppercase, removing spaces, and stripping punctuation and see if that leads to better matches

Isaac Liu 11 minutes ago also, you don't have to unzip, there are already unzipped versions on the box

CHENGZHENGXING < 1 minute ago I will try the concept field but I still think there will be many inconsistencies. I'' try.

  1. Because I process data in Google Colab, a zip file is easier to upload to the Colab.
ijyliu commented 6 months ago

@current12 for the list of companies you can limit to the ones with S&P 500 ratings 2010-2016 in that combined credit rating file

can you check out the variables that go into Altmans Z score and make sure we have those for whatever data source you are using? It's in the original graph NN paper

https://pubsonline.informs.org/doi/epdf/10.1287/ijds.2022.00018

current12 commented 6 months ago

@current12 for the list of companies you can limit to the ones with S&P 500 ratings 2010-2016 in that combined credit rating file

can you check out the variables that go into Altmans Z score and make sure we have those for whatever data source you are using? It's in the original graph NN paper

https://pubsonline.informs.org/doi/epdf/10.1287/ijds.2022.00018

Thanks, I have retrieved all the raw data(a lot of quantitative attributes) and it comes from the same source as the paper. In the next step, we needs to process the data to derive some more advanced statistic( based on the variables in the paper)

ijyliu commented 6 months ago

yeah, you can go ahead and make those variables in the paper, and also post a link to a notebook once you've done EDA. and feel free to check off the other tasks here once they're done

ijyliu commented 6 months ago

@current12 you should investigate these financial statement variables on either the raw data or in all_data/the merged data further once you have a change. there are some really nasty outliers and values in the quadrillions of dollars!

(from all_data)

image

current12 commented 6 months ago

@current12 you should investigate these financial statement variables on either the raw data or in all_data/the merged data further once you have a change. there are some really nasty outliers and values in the quadrillions of dollars!

(from all_data)

image

Thx, I have investigated the raw data. It's due to some value that is not represented in thousands that are mis-multiply by 1000. Therefore, extremely big number will occur. I'll try to fix it.

current12 commented 6 months ago
image

I try to fix this through deals with outliers that outside 97.5% quantile and 2.5% quantile like this

ijyliu commented 6 months ago

eventually we will want to move this out of the eda file and further upstream to where you create the datasets. be very selective in the variables you apply it to within the tabular financial data, we don't want to edit things like years or dates

when you do it, be sure to print out a summary statistics table before and after you apply it so we have documentation of what it changed

but overall I think it's a good approach and i've seen it before in the finance literature

ijyliu commented 6 months ago

moving deal_with_invalid_numbers to combined all data file

ijyliu commented 6 months ago
ijyliu commented 6 months ago

@current12

realized that many of the values are not in USD. for example, below we have some items in Canadian dollars (CAD). we need to be mindful of the reportedCurrency column

image

I'm going to proceed with dropping non-USD before checking the extreme values again. if that resolves the stuff with things being in quadrillions, etc. I'll remove that code

ALWAYS look at your data, I guess, haha

ijyliu commented 6 months ago

@current12

here are the company by year by quarter we lost by limiting to USD. looked at the API and the raw data, don't think we can do anything to salvage them without finding another data source and scraping.

symbol_calendarYear_period_lost_due_to_USD_limitation_and_in_credit_rating.xlsx

here are new summary statistics before applying the correction, but after limiting to USD only. no more values in quadrillions, though there are some values in 400 trillion, etc. that look suspicious. these items don't always end in 000s

sum_stats_before_correction - formatted.xlsx

overall, I don't think we should do the correction in it's current form anymore because mis-multiplication by 1000s isn't clearly the primary explanation for inaccuracy. I'd instead suggest just either no correction, or Winsorizing (what we've been doing, with the quantiles, except not conditioning on 000.00). What do you think?

current12 commented 6 months ago

I suggest we still needs to remove extreme suspicious data that ends in 000000.0 like the following(maybe Winsorizing is enough for dealing with it).

image
ijyliu commented 6 months ago

that's in trillions though. so it's theoretically possible (AAPL market cap and a few others in trillions). though actually pretty sure that happened a long time after 2016. so maybe you're right.

I added a bunch more stuff, if you can pull i'd suggest looking at these

https://github.com/current12/Stat-222-Project/blob/main/Code/Data%20Loading%20and%20Cleaning/Tabular%20Financial/Combine%20and%20Clean%20Tabular%20Financial%20Statements%20Data.ipynb

https://github.com/current12/Stat-222-Project/blob/main/Code/Exploratory%20Data%20Analysis/Tabular%20Financial/Tabular%20Financial%20Statements%20Data%20EDA.ipynb

do you have any other suggestions? I'm trying to get market capitalization from the API (will probably have to run overnight) because using common_plus_preferred_stock in place of it is extremely inaccurate. and probably as a result of that the final Altman Z-scores seem way too low (average like 0.72 where in the graph NN paper they were like 2)

ijyliu commented 6 months ago

wow actually the api was very fast. fixed up and the altman_z makes sense now

of course you can still review if you'd like but i'd prioritize making the regression