Feature request: handle user-defined time-varying universes (and better error checks with temporary `nan`s in user-provided returns)

cvxgrp / cvxportfolio

Portfolio optimization and back-testing.

https://www.cvxportfolio.com

GNU General Public License v3.0

990 stars 257 forks source link

Feature request: handle user-defined time-varying universes (and better error checks with temporary `nan`s in user-provided returns) #137

Closed pcgm-team closed 2 months ago

pcgm-team commented 8 months ago

I get the error:

ValueError: Parameter value must be real.

with callback to:

in FactorModelCovariance.values_in_time(self, **kwargs)

sigma_sqrt = self.Sigma.current_value \
if self._alreadyfactorized \
else project_on_psd_cone_and_factorize(
self.Sigma.current_value)
# numpy eigendecomposition has largest eigenvalues last
--> self.factor_exposures_parameter.value = sigma_sqrt[:, -self.num_factors:].T
d = np.sum(sigma_sqrt[:, :-self.num_factors]**2, axis=1)

I checked that there are many non-nans in every row (500+), my num_factors is set to 15, the datatype of all values is correct, etc.

I assume this might have something to do with the matrix not being positive semi-definite or some other numerical problem. It only occurs for specific date of data. Not sure how to fix and very unclear error message.

joseortiz3 commented 8 months ago

@pcgm-team The Issues section on github repositories is generally for reporting actual problems/bugs with a project (cvxportfolio) with the intention of getting bugs fixed, improvements made, etc. Questions about linear algebra, tech support, etc are better asked elsewhere (e.g. stackexchange, reddit, etc).

The first step in an issue report is always to make a minimally reproducible example, which means providing enough information that someone else can reproduce your problem. The second is to compare the expected behavior to the actual behavior in detail (i.e. if you think your 500x500 correlation matrix should have a 15-factor decomposition, state why, and what the answer should be).

To address your question:

N=15 factors is quite a lot. Really, nobody has any business decompositing a correlation matrix beyond perhaps N=3 factors for stock prices. You are probably on the edge of violating the linear algebra requirements / numerical requirements and finally hit the issue: the correlation matrix has to be symmetric and positive semidefinite, have rank >= N, and therefore have N eigenvalues that are nonzero (ideally, not anywhere near zero when represented on computer).

For further questions about the linear algebra requirements or numerical stability of factor analysis, seek other sources.

enzbus commented 8 months ago

Thanks @joseortiz3 and yes @pcgm-team can you please provide more context; what were the inputs to FactorModelCovariance (user-provided Sigma, F and d, or F, d and Sigma_factors, or built-in forecaster)? You are right that this breakage should have been caught earlier, there were negative eigenvalues there which weren't filtered.

enzbus commented 8 months ago

I've been looking closer at the code and it appears that np.linalg.eigh was called on your instance, its documentation states that the only way it returns complex outputs is if its inputs are complex. I'm assuming you were giving it a user-provided Sigma as a dataframe, so maybe you can find the issue there? We might add a check in the user-provided data initializer to make sure that no complex values are provided as inputs.

enzbus commented 8 months ago

Mmmh, it's different; this code

universe = ['AAPL', 'GOOG', 'AMZN', 'NVDA', 'TSLA']
Sigma = pd.DataFrame(np.eye(5, dtype=complex), universe, universe)
risk = cvx.FactorModelCovariance(Sigma=Sigma, num_factors=2)
policy = cvx.SinglePeriodOptimization(cvx.ReturnsForecast() - risk)
cvx.StockMarketSimulator(universe).backtest(policy, start_time='2024-02-01')

works, giving just a CVXPY warning (complex are cast to real). Please then also post Cvxportfolio and CVXPY versions you are using...

The CVXPY warning is like this

/home/enzo/repos/cvxportfolio/env/lib/python3.11/site-packages/cvxpy/cvxcore/python/canonInterface.py:63: ComplexWarning: Casting complex values to real discards the imaginary part
  param_vec[col:col + size] = value

pcgm-team commented 8 months ago

I believe the error actually has to do with the way that the backtester and forecaster handles stocks with nan on the period.

I've updated my cvxpy and cvxportfolio to latest versions and now the error message has changed to (on same data/row):

ForecastError: HistoricalFactorizedCovariance is given a dataframe with at least a column that has no values.

I do have reason to believe this is an issue with the project itself, since NaN handling is stated as a core functionality of the project.

The error persists when changing num_factors=2 (my bad on the previous linear algebra).

I'm currently unable to simply recreate (or publicly share the data) - I will continue trying. I've confirmed that the returns data itself has no columns (or rows) with all NaN.

But all my function calls are as follows:

market_data = cvx.UserProvidedMarketData(returns = returns, volumes = volumes, prices = prices, online_usage = False, cash_key='USDOLLAR')
ret_forecast = cvx.ReturnsForecast(r_hat = r_hat)
    policy = cvx.SinglePeriodOptimization(objective = 
                                 ret_forecast - 4 * cvx.FactorModelCovariance(num_factors=2),
                                 constraints= [cvx.LeverageLimit(1.5)],
                                solver='ECOS', ignore_dpp=True, include_cash_return= False)
sim = cvx.MarketSimulator(market_data=market_data)
bt_result = sim.backtest(policy, initial_value = 1500000, start_time = start_time)

I've also confirmed that setting min_history=pd.Timestamp(1, 'D') rather than default 1 year does not remove the error.

It runs fine until it hits a particular date, and prior to that date- there are NaN that are handled, which is what makes it hard for me to recreate.

The full error call is here:

cvxportfolio\\forecast.py:304, in BaseMeanVarForecast._initial_compute(self, t, **kwargs)
    302 self._denominator = self._compute_denominator(df, emw_weights)
    303 if np.min(self._denominator.values) == 0:
--> 304     raise ForecastError(
    305         f'{self.__class__.__name__} is given a dataframe with '
    306             + 'at least a column that has no values.')
    307 self._numerator = self._compute_numerator(df, emw_weights)
    308 self._last_time = t

ForecastError: HistoricalFactorizedCovariance is given a dataframe with at least a column that has no values."

enzbus commented 8 months ago

OK, thanks for providing more context. Here's what's going on at time of failure t

MarketSimulator calls MarketData.serve(t) on the market data object
MarketData looks at the returns dataframe before time t and chooses the investable universe based on its min_history; all names that have a non-nan return for at least min_history are included; since you select online_usage=False it also filters to avoid names that have a nan return for the period (which is not used by the policy, but by the simulator itself)
the market data snapshot at time t is handled by MarketSimulator, historical data is given to the policy that produces the optimal weights for the period
The policy, SinglePeriodOptimization passes the historical data down to all its components, including the forecaster inside FactorModelCovariance, which gives you an error because it can't do its job; it is given historical returns with one or more columns that are all nan (so no covariance can be estimated)

What can you do:

Call market_data.serve(t) with the timestamp of failure; you can get it by enabling INFO level logging; this returns a tuple of read-only dataframes (and series), if I remember correctly past_returns is the first element: that is the dataframe that is given to HistoricalFactorizedCovariance
Change the min_history argument to UserProvidedMarketData (I don't know how an asset with a full nan history of past returns got through, but you might get the filtering you want by changing that parameter)
Tell us about the frequency of the data you're using, I assume it's not daily; if it has very different period lenghts (say, a bunch of 1 minute periods and then a long overnight period) there might be edge cases which we haven't tested, we tested mostly with periods of roughly the same length

What I'll do:

Make that error more informative; add the timestamp and the name(s) that cause it (that error check is new in 1.2, BTW, lots of improvements in low-level objects)
Investigate to make sure that MarketData.serve never returns a past_returns dataframe with a full nan column (I thought it was impossible), please help me find out how it happened

pcgm-team commented 8 months ago

By digging into the code, I've found the source of the issue. It has to do with stocks having a NaN period after being non-nan for awhile. The issue arose in my data specifically whenever AMD was re-added to the SP500. My data has NaN whenever the symbol is not included in the SP500, and is non-nan when the symbol is in the SP500.

When I print the result of:

past_returns, current_returns, past_volumes, current_volumes, \
                 current_prices = market_data.serve('2017-11-06 09:30:00-05:00')

rets_to_input = past_returns.iloc[:, :-1]
hist_forecast = cvx.forecast.HistoricalCovariance()
ewm_weights = None
denom = hist_forecast._compute_denominator(rets_to_input, ewm_weights)
denom['AMD'][denom['AMD']==0.0].index

The output is:

Index(['AAL', 'AAP', 'ALLE', 'AME', 'AMG', 'ATVI', 'AVGO', 'AWK', 'BFH', 'CFG',
       'CHD', 'CPRI', 'CSRA', 'CXO', 'DAL', 'DISCK', 'EQIX', 'ESS', 'EXR',
       'FRT', 'GM', 'GOOG', 'HBI', 'HCA', 'HPE', 'HSIC', 'ILMN', 'JBHT', 'KHC',
       'KSU', 'MAC', 'META', 'MHK', 'MLM', 'NAVI', 'NLSN', 'NWS', 'NWSA', 'O',
       'PYPL', 'QRVO', 'RCL', 'REGN', 'SIG', 'SLG', 'SWKS', 'SYF', 'TFCFA',
       'TSCO', 'UAA', 'UAL', 'UDR', 'UHS', 'URI', 'VRSK', 'VRTX', 'WRK', 'WTW',
       'XEC', 'ZTS'],
      dtype='object', name='symbol')

So specifically, it's the stocks which had no history prior to when AMD goes to NaN in my data. Ie. AMD is

while the other stocks are for example:

enzbus commented 8 months ago

Ok, that's actually very interesting, thanks for sharing your findings. So, it seems MarketData behaved in the correct way, but the covariance matrix can not be computed. A minimal example of what's going on would be this:

import pandas as pd
past_returns = pd.DataFrame([
                            [-0.01, np.nan],
                            [np.nan, 0.01],
                            [np.nan, -0.01],
                            [np.nan, 0.01],
                            [0.01, np.nan],
                            ], columns=['AMD', 'AAPL'])
print(past_returns.cov())

Which prints

         AMD      AAPL
AMD   0.0002       NaN
AAPL     NaN  0.000133

So, I guess this is not an issue with Cvxportfolio (although that is a breaking mode I hadn't anticipated, and should be much easier to diagnose than it was now), but with the data you're providing to it.

It seems to me that you're using nan returns as trick to change the investable universe (you put a nan whenever you don't want a stock to be included), correct? UserProvidedMarketData could be amended to make the investable universe chosen by the user at each time steps as well, that's probably the fix that you need here (so you don't have to inject nans into the returns' dataframe). Thoughts?

pcgm-team commented 8 months ago

It is true that the time-varying investable universe feature should solve the issue for my specific case (and improve the actual performance of the outputs for mine and many uses).

However, presumably you would want to support cases when the data actually goes NaN for whatever reason and you don't have a detailed investable universe for each timestamp- for one example if a stock is delisted and relisted under the same name (and most retail users haven't paid for detailed information on investable universes or may not be restricting to something like sp500). There should be a way to add a check, throw the symbol out of consideration, and print a warning. For example a parameter could be added that is like min_history, but it specifies the min_history of required covariance data with the other investable stocks at each timestamp. If it isn't met (for any other investable stock) the stock is declared uninvestable.

What do you think?

joseortiz3 commented 8 months ago

What should happen to missing values corresponding to lack of data for a stock in a certain time period isn't well-defined. In some cases, the value is missing because the stock was delisted, and its shares are now worthless, or halted, and its shares are frozen for hours or possibly years. In other cases (ie this one), the stock is missing from the index temporarily. In others, the stock identifier (e.g. ticker) changed because of a merger, buyout, etc and generally there is an unrecorded cash payout or payment in shares of another company. Is it not sort of "mission creep" for cvxportfolio to handle one or more of these scenarios?

A minimal "fix" would be that cvxportfolio can set a default friendly behavior which is simply setting missing prices to .ffill(), returns to zero, and volume to zero, and optionally artificially "selling off" that asset as cash. Such a functionality should be optional I think, since some users will prefer that an error is thrown.

The user on the other hand can simply remove that asset from their universe of data being fed to cvxportfolio, or replace the missing values how they see fit (e.g. returns = returns.fillna(0), price = price.ffill(), etc). Prompting the user "your data has missing values for {ticker}. Unclear what you want me to do with these - please fix on your end" is reasonable.

In general, asset data is very sparse in (asset identifier x date) space over long enough time periods since stocks are being listed and delisted all the time. The sort of two possibilities is to use multiple rectangular dataframes (row: date, column: asset identifier) for needed quantities (this is how cvxportfolio does it, which does match the paper nicely), or instead use multi-index dataframes or series (row: [identifier x date], columns: price, volume, etc). The former is more sparse the longer the data interval. The latter is dense, no matter how the universe changes over however long of time. This is to say, cvxportfolio is already built for universes that don't change too much, because it uses multiple rectangular dataframes instead of a single multi-indexed dataframe or multiple multi-indexed series for quantities like returns, price, etc.

pcgm-team commented 8 months ago

I agree, but cvxportfolio already handles it exceptionally well by-default at this point in my opinion (the base handling you proposed is already implemented); and the remaining fix seems not overly complicated. (both the user-provided time-varying universe and the check to avoid covariance error)

It's true my proposed check would need to be done at every timestamp; but it should be very cheap to simply count the number of concurrent values in the past_returns dataframe and make sure they are above threshold.

I think expecting it to run on normal data is a different bar than expecting super-proper behavior in all cases, which I agree would bloat the mission.

enzbus commented 8 months ago

Thanks both. There's a lot to unpack there, and true that we can't have Cvxportfolio handle every scenario, like asset names that are re-used for a different asset. Identifiers should be unique, otherwise all forecasts based on historical data are meaningless. For users with enough resources handling their own naming to ensure uniqueness (e.g., use CUSIPs, ...) should be enough. If one uses the free Yahoo Finance data that is not an issue, delisted stocks disappear from the history, and the new name only has history for itself.

Cvxportfolio uses already a lot of heuristics to clean the data (however, currently only in the Yahoo Finance interface), like forward filling missing prices. MarketData handles a name for which the returns' become nan at a certain point by removing it from the investable universe. A name that is removed by the investable universe is handled both by MarketSimulator and BacktestResult I believe correctly; the default behavior is to move whatever position the portfolio had in it to cash. In both cases that is done by specialized methods which can be easily overridden.

If you don't want the name to be removed, you should clean the returns before providing them (e.g., ffill'ing the prices before computing returns, like it's done in YahooFinance, so any period with unknown market activity basically gets all zero returns. I was thinking of adding to UserProvidedMarketData the capability of computing returns internally given just the prices; it might be easier (it's safer to ffill prices than fillna(0.) returns, look-ahead biases might be introduced if not done correctly.

Regarding checks for co-occurrence of names in order to compute the covariance, I'm afraid that is what you incurred in; that check ForecastError: HistoricalFactorizedCovariance is given a dataframe with at least a column that has no values. complained about no co-occurence; I just need to amend the message in the case of covariance estimations. In fact I might add to it; it could give warnings if any pair of stocks has less than some amount of time in which they co-occurred.

I still believe, however, that it should not happen. Each column of the returns' dataframe should only have nans at the start and/or at the end; there shouldn't be nans in between. Given that condition, there is no issue of lack of co-occurence (the minimum co-occurrence is the minimum history). The condition is true with data coming from YahooFinance and should be true also for user-provided data, i.e., you should clean the data before providing it. In your case the nans were added manually, so I think it's more important to add user-defined and time-changing universe selection to the MarketData servers. It might happen in the next few days, I'm working on adding linear regression capabilities to all forecasters but this feature should also make to 1.3.0.

pcgm-team commented 8 months ago

I was thinking of adding to UserProvidedMarketData the capability of computing returns internally given just the prices; it might be easier (it's safer to ffill prices than fillna(0.) returns, look-ahead biases might be introduced if not done correctly.

I think this is excellent.

I also agree now that you are correct that the only time the error could reasonably come up is the time-varying-universe use-case, since everything else could/should be handled with ffill() price and proper symbol naming; so work-arounding it internally doesn't make sense.

Thanks @enzbus.