Create factor preprocessing functions

JustinMShea / ExpectedReturns

36 stars 18 forks source link

Create factor preprocessing functions #11

Open vi-to opened 4 years ago

vi-to commented 4 years ago

Many empirical analyses share common preprocessing techniques that a researcher may need to carry on the data set being analyzed.

Adding an helper function in this regard would:

reduce the boilerplate in actual fitting functions, making them clearer;
importantly, add transparency to the cleaning procedures the study relies upon.

In particular, this is true for Engle et al. (2016), from which the motivation behind the implementation comes from. The authors base most of their analyses on data cleaned (or adjusted) in a way or another. It is therefore important to account for these aspects to be then able to compare actual estimation results.

vi-to commented 4 years ago

All the models targeted I can see so are aimed at studying the cross-section. Unless we want to introduce recurrent data structure transformations, most of the time data.frame seems perhaps a better alternative than xts because it allows to store all the panel. To my knowledge, xts cannot store multiple data types as it is a matrix internally. Also, being index-based objects, they cannot store the same dates multiple times.

I would actually prefer to work with xts if feasible, so any suggestion in this direction is sought after. We can certainly pass to xts objects as soon as the cross-sectional analyses are done or whenever it makes sense. For example, I would really like to be able to plot with it.

cc: @JustinMShea @braverock @jaymon0703

braverock commented 4 years ago

Note that Panel data is really inefficient in storing factor data as the number of instruments and factors increases. Also note that the panel data needs to be transformed into a matrix-like format before you can do the regression, since the data needs to be in columns by descriptive/target variables.

Since most vendor data is organized as panel data, we will deal with a lot of panel data, but don't consider this an optimal format... It is actually pretty bad for almost everything we care about.

Also, note that the cross-section may or may not be what you want in any given case.

Just work on the models now. Please don't try to guess at how the data will be structured 'most of the time', because you don't have enough examples to build good patterns.

As you are already finding out, most of the time the parser or the model are only a few lines of code. The patterns will become obvious, and refactoring will not be terribly painful.