Feature Request: Online Data Loading

Features requests are well-received but will probably be answered with a suggestion that you develop them and contribute.

Specifications

Problem: Feature request
OS: Windows 11
Cvxportfolio version: 1.3.1
Python version: 3.11.0
Cvxpy version: 1.4.3
Pandas version: 2.0.3
Data: User-provided Factor Model and Return Data

Description

In cvxportfolio, for now it seems to require loading the pandas dataframe before the optimization and backtesting. This could be a issue to have huge universe with a large set of factors and require a huge memory usage for a exploration on a long history.

Is it possible to create a query class that can be fed as data, so that at every time t that data is used, conducting the query online?
And also export the optimization results and log incrementally to a file at each time point, so that recovering from any crash will be easier to conduct.

If there have already been base classes available to implement these features in cvx-portfolio, could you provide some suggestions? If not, can you add these features? Thank you so much.

Short answer is yes, it can be done relatively easy. Long answer is that it may create problems down the line, and limits the flexibility of the system.

First you need to implement your data loading mechanism, like a query to your database table, in a subclass of MarketData. The methods you need to implement are documented here https://www.cvxportfolio.com/en/stable/data.html#cvxportfolio.data.MarketData . The heavy lifter is the serve method, which takes the timestamp in the back-test and returns a view of the past market data (past open-to-open returns, past market volumes, ...), which are used both by the market simulator and the trading policies, and the current data for the simulator. Two other methods are trivial, only other tricky one might be the trading_calendar which needs to know the future trading times. Your custom MarketData server is then passed to the initializer of MarketSimulator, and that part is done.

For saving incremental results, you should subclass BacktestResult and implement whatever DB logic you wish. What it does now essentially are incremental table inserts in a few Pandas DataFrames (initial positions, target weigths, ...) and some Series like realized costs per iteration, a few timers, .... You could also redirect the Python log stream to some persistent storage, now it's also saved in memory. It should all be possible, only issue might be that BacktestResult was only recently opened up for extension (had private interface before) and I still might need to do some cleaning there.

Now, the negatives. Some might be debatable, but are my opinion at least. You lose multiprocessing. You probably lose reproducibility unless you are very careful to make sure the tables you refer to aren't modified. It's harder to debug, since potentially faulty operations like DB queries are done down in the internals.

cvxgrp / cvxportfolio

Feature Request: Online Data Loading #155

Specifications

Description