Closed ijyliu closed 8 months ago
here's the actual data dictionary
From Slack 2024-02-16
CHENGZHENGXING 43 minutes ago I have loaded the tabular Financial Statement Data and performed exploratory data analysis, the main problem is that each data file is a JSON file and the labels in it vary. I tried to extract five attributes ['Total Liabilities', 'Total Current Liabilities', 'Total Current Assets',' "Retained Earnings', 'Net Sales'] and there are only 66 files that contain all of them without Null. 2 replies
Isaac Liu 17 minutes ago can you try with the concept field instead of the label field? looking at the data, that seems more consistent i'd also suggest converting to uppercase, removing spaces, and stripping punctuation and see if that leads to better matches
Isaac Liu 11 minutes ago also, you don't have to unzip, there are already unzipped versions on the box
CHENGZHENGXING < 1 minute ago I will try the concept field but I still think there will be many inconsistencies. I'' try.
@current12 for the list of companies you can limit to the ones with S&P 500 ratings 2010-2016 in that combined credit rating file
can you check out the variables that go into Altmans Z score and make sure we have those for whatever data source you are using? It's in the original graph NN paper
https://pubsonline.informs.org/doi/epdf/10.1287/ijds.2022.00018
@current12 for the list of companies you can limit to the ones with S&P 500 ratings 2010-2016 in that combined credit rating file
can you check out the variables that go into Altmans Z score and make sure we have those for whatever data source you are using? It's in the original graph NN paper
https://pubsonline.informs.org/doi/epdf/10.1287/ijds.2022.00018
Thanks, I have retrieved all the raw data(a lot of quantitative attributes) and it comes from the same source as the paper. In the next step, we needs to process the data to derive some more advanced statistic( based on the variables in the paper)
yeah, you can go ahead and make those variables in the paper, and also post a link to a notebook once you've done EDA. and feel free to check off the other tasks here once they're done
@current12 you should investigate these financial statement variables on either the raw data or in all_data/the merged data further once you have a change. there are some really nasty outliers and values in the quadrillions of dollars!
(from all_data)
@current12 you should investigate these financial statement variables on either the raw data or in all_data/the merged data further once you have a change. there are some really nasty outliers and values in the quadrillions of dollars!
(from all_data)
Thx, I have investigated the raw data. It's due to some value that is not represented in thousands that are mis-multiply by 1000. Therefore, extremely big number will occur. I'll try to fix it.
I try to fix this through deals with outliers that outside 97.5% quantile and 2.5% quantile like this
eventually we will want to move this out of the eda file and further upstream to where you create the datasets. be very selective in the variables you apply it to within the tabular financial data, we don't want to edit things like years or dates
when you do it, be sure to print out a summary statistics table before and after you apply it so we have documentation of what it changed
but overall I think it's a good approach and i've seen it before in the finance literature
moving deal_with_invalid_numbers to combined all data file
@current12
realized that many of the values are not in USD. for example, below we have some items in Canadian dollars (CAD). we need to be mindful of the reportedCurrency column
I'm going to proceed with dropping non-USD before checking the extreme values again. if that resolves the stuff with things being in quadrillions, etc. I'll remove that code
ALWAYS look at your data, I guess, haha
@current12
here are the company by year by quarter we lost by limiting to USD. looked at the API and the raw data, don't think we can do anything to salvage them without finding another data source and scraping.
symbol_calendarYear_period_lost_due_to_USD_limitation_and_in_credit_rating.xlsx
here are new summary statistics before applying the correction, but after limiting to USD only. no more values in quadrillions, though there are some values in 400 trillion, etc. that look suspicious. these items don't always end in 000s
sum_stats_before_correction - formatted.xlsx
overall, I don't think we should do the correction in it's current form anymore because mis-multiplication by 1000s isn't clearly the primary explanation for inaccuracy. I'd instead suggest just either no correction, or Winsorizing (what we've been doing, with the quantiles, except not conditioning on 000.00). What do you think?
I suggest we still needs to remove extreme suspicious data that ends in 000000.0 like the following(maybe Winsorizing is enough for dealing with it).
that's in trillions though. so it's theoretically possible (AAPL market cap and a few others in trillions). though actually pretty sure that happened a long time after 2016. so maybe you're right.
I added a bunch more stuff, if you can pull i'd suggest looking at these
do you have any other suggestions? I'm trying to get market capitalization from the API (will probably have to run overnight) because using common_plus_preferred_stock in place of it is extremely inaccurate. and probably as a result of that the final Altman Z-scores seem way too low (average like 0.72 where in the graph NN paper they were like 2)
wow actually the api was very fast. fixed up and the altman_z makes sense now
of course you can still review if you'd like but i'd prioritize making the regression
Information about the dataset: https://www.kaggle.com/datasets/finnhub/reported-financials/data
Test all code for the below on a sample of 1-2 statements from different companies and years before rolling out to all data.