anttsou / qmj

8 stars 8 forks source link

get_info or tidyinfo is handling some data badly. Possibly incorrectly inserting anomalous data. #18

Open rynkwn opened 8 years ago

rynkwn commented 8 years ago

Ex: In our financials data frame: capture

However, looking directly at the quantmod data, we have:

CASHFLOWS: image

BALANCESHEETS: image

INCOME STATEMENTS: image

In other words, either get_info or tidyinfo is poorly handling its receipt of information to incorrectly store data. Will look into this further.

rynkwn commented 8 years ago

Some of its numbers, for example DP.DPL for Depreciation.Depletion, don't appear to come from the correct source. In fact, I'm unsure where it's getting those numbers from. Checking the adjoining rows doesn't make it clear that some off-by-one error is assigning LE's values elsewhere, but it's possible that it's incremented sufficiently by that point that the "off-by-one" error is pretty enormous. That, and logically then the last X rows should essentially be lacking data. Something that isn't true.

Will look into this more deeply tomorrow.

rynkwn commented 8 years ago

Playing with it now.

So taking a small sample of ~100 companies around LE still attributes incorrect data to LE. However, the get_info function on LE alone does produce correct data.

If I take the last company in my subset, LPI, I find that the raw financials does indeed produce the correct cash flow statement.

Hm. Perhaps quantmod is grabbing the wrong data?

rynkwn commented 8 years ago

Scratch that, it seems as if the correct data is being grabbed by get_info. Unsure why I thought otherwise just above. Will check tidyinfo.

rynkwn commented 8 years ago

Tidyinfo is producing the correct data. Though it's possibly being somewhat harsh in cutting out 2012 data for LE due to the lack of a balance sheet, it doesn't seem pressing.

Will try re-generating our financials data set to see if inaccuracy persists. Maybe data set was produced by an older, buggy version?

rynkwn commented 8 years ago

Correction*

Upon closer inspection, my conclusion is that get_info/tidyinfo is correctly grabbing and producing data. My guess is that, as we only have access to the past 4 10-K filings, we were grabbing weird/old data from LE.

To be completely sure, I've also re-built our financials data set. LE now has reasonable data. On top of that, it's also moved forward

image

It's also moved forward around 700 rows. Implicitly this means that there was an enormous gain in information in the preceding companies, which is also concerning.