OpenSourceAP / CrossSection

Code to accompany our paper Chen and Zimmermann (2020), "Open source cross-sectional asset pricing"
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3604626
GNU General Public License v2.0
716 stars 215 forks source link

I am witnessing a range of empty values (daily data) #44

Closed firmai closed 3 years ago

firmai commented 3 years ago

The Piotroski Score at the very bottom shows that for each row there is at least on portfolio value missing (null, NaN)

image

image

The PS score is the weirdest of them all:

image

Originally posted by @firmai in https://github.com/OpenSourceAP/CrossSection/issues/43#issuecomment-859581890

chenandrewy commented 3 years ago

It looks like there is slight error in our computation of the Piotroski Score (PS) portfolios. We categorize this signal as continuous-decile when it should have been discrete (ranging from 0 to 9). We also should be going long stocks with a score of 8 or 9 and shorting stocks with a score of 0 or 1. See caption for Table 3 of Piotroski (2000):

image

Gonna be honest, not sure this is a high priority thing to fix. But we'll try to remember to fix this when we update the data next year.

firmai commented 3 years ago

Hi @chenandrewy thanks for the reply, I think it's a bit more systematic than that, the screenshots that I gave in the beginning also have null values, and there is a larger list of variables that also have this problem. I know how hard it is to publish something open source, it gets combed over with a fine comb, but there is something quite great in that as it will ensure that in the long run, you have the most reputable (verifiable) study on the topic - so kudos to that.

chenandrewy commented 3 years ago

Ah, sorry I should have looked more closely. On DivSeason, can you help me see when the NAs stop appearing? If it's just one month then it's probably because the strategy shorts dividend payers that happen to not be paying, and perhaps there's not enough data back there. I guess DivSesason is a strategy we spent a lot of time on, so I'm not very concerned about missing values.

Overall, three are missing values for many reasons. The simplest one is that some variables are not continuous enough to be sorted into quintiles. Limited data early in the sample interacts with this issue. Since IBES begins around 1985, I think this is the most likely cause for ExclExp.

We'll add this to the FAQ.

chenandrewy commented 3 years ago

It turns out the missing portfolio returns for ExclExp was not due to IBES data availability. Instead, it's because ExclExp has a big mode at 0. Intuitively, ExclExp is "Street" aka "Non-Gaap" aka IBES earnings less GAAP earnings. So a lot of the time there is no difference, no funny exclusions and you get a ton of zeros. Here is the distribution of Excluded Expenses in June 1996

image

And then when you try to sort into quintiles, you get annoying edge cases.

In our portfolio code, we use inequality constraints on the extreme quantiles to maintain good behavior in the long-short portfolios. In the interior portfolios, you get these edge cases. This will be a pain for ML folks. But for single sorting it's fine.

Nothing to see here, unless you're an accounting nerd, but we'll ad this to the FAQ.

chenandrewy commented 3 years ago

We updated the FAQ to explain these missing values:

https://www.openassetpricing.com/faq/#q-missings