Relocating shrcd and exchcd filters from SignalMasterTable.do to SignalDoc.csv

chenandrewy commented 1 year ago

Many thanks to @junkyungauh for pointing this out.

SignalMasterTable.do has:

These filters should not be applied until the portfolios code, via SignalDoc.csv's "Filter" column. As described in the paper, we try to put off filtering until the portfolio generation step so that users of the data have the most flexibility.

Should this standard filter be applied everywhere in the portfolio generation step? I'm not sure of the answer to this. We should at least review a few of the original papers before we decide.

Currently, we're inconsistently applying these filters because we sometimes use SignalMasterTable.dta as the "backbone" of the signal (e.g. Mom6m.do), and other times use dailyCRSP.dta (MaxRet.do) or some other basic dataset. As a result, Mom6m will have more missing values than MaxRet.do.

I don't think this change will have a huge effect. Most of stocks with weird exchcd-shrcd combinations are missing data for everything but historical market prices. For example, about 80% of these weird stocks are missing ceq:

chenandrewy commented 1 year ago

Some more info on the odd exchcds and shrcds (which might make it painfully obvious that I don't study ETFs). Here is the share of CRSP permnos with odd codes over time:

Odd codes were rare until the 90s, but now account for 45% of permnos! The bulk of the odd codes are ETFs. Others are ADRs and SBIs, Thankfully, the "when-issued trading" codes for the standard exchanges (31-33) are rare.

Given how these odd codes do not correspond to "stocks" in the way most asset pricing people think of "stocks," we should probably apply the standard filter by default in the portfolios code. Not sure about keeping when-issued-trading, but thankfully that's rare.

chenandrewy commented 1 month ago

Here's a review of how Jegadeesh and Titman 1993; Ang, Hodrick, Xing, Zhang 2006; and Hou and Moskowitz 2005 handle it. These are all papers that use only price data and span distinct types of predictors, as well as distinct teams.

Jegadeesh and Titman seem to only mention these codes in passing. "Our analysis of NYSE and AMEX stocks documents significant profits in the 1965 to 1989 sample period." This is consistent with imposing the standard code screens.

AHXZ, for the VIX beta portfolios, say they "run the regression for all stocks on AMEX, NASDAQ, and the NYSE." The idiovol ports they don't say this, but it seems implicit based on their other discussions of excluding AMEX and NASDAQ.

For Hou and Moskowitz: "From 1963 to 1973, the CRSP sample includes NYSE and AMEX firms only, and post-1973 NASDAQ firms are added to the sample."

Bali, Engle, and Murray's textbook also focuses on standard codes: "The sample used in Part II of this book as well as in a large number of empirical asset pricing studies is a monthly sample that contains all U.S.-based common stocks in the CRSP database. Therefore, for each month t, the sample is constructed by taking all U.S.-based common stocks in the CRSP database as of the end of the given month.... ....U.S.-based common stocks are identified as the subset of these securities that have a share code (SHRCD field in the msenames file) value of either 10 or 11. We refer to this sample as the CRSP U.S.-based common stock sample, or simply the CRSP sample."

Long story short, think imposing standard codes everywhere will have very little effect on replications.

OpenSourceAP / CrossSection

Relocating shrcd and exchcd filters from SignalMasterTable.do to SignalDoc.csv #133