cvxgrp / cvxportfolio

Portfolio optimization and back-testing.
https://www.cvxportfolio.com
Apache License 2.0
910 stars 242 forks source link

Feature Request: Online Data Loading #155

Open GreenlandZZY opened 2 months ago

GreenlandZZY commented 2 months ago

Features requests are well-received but will probably be answered with a suggestion that you develop them and contribute.

Specifications

Description

In cvxportfolio, for now it seems to require loading the pandas dataframe before the optimization and backtesting. This could be a issue to have huge universe with a large set of factors and require a huge memory usage for a exploration on a long history.

If there have already been base classes available to implement these features in cvx-portfolio, could you provide some suggestions? If not, can you add these features? Thank you so much.

enzbus commented 2 months ago

Short answer is yes, it can be done relatively easy. Long answer is that it may create problems down the line, and limits the flexibility of the system.

First you need to implement your data loading mechanism, like a query to your database table, in a subclass of MarketData. The methods you need to implement are documented here https://www.cvxportfolio.com/en/stable/data.html#cvxportfolio.data.MarketData . The heavy lifter is the serve method, which takes the timestamp in the back-test and returns a view of the past market data (past open-to-open returns, past market volumes, ...), which are used both by the market simulator and the trading policies, and the current data for the simulator. Two other methods are trivial, only other tricky one might be the trading_calendar which needs to know the future trading times. Your custom MarketData server is then passed to the initializer of MarketSimulator, and that part is done.

For saving incremental results, you should subclass BacktestResult and implement whatever DB logic you wish. What it does now essentially are incremental table inserts in a few Pandas DataFrames (initial positions, target weigths, ...) and some Series like realized costs per iteration, a few timers, .... You could also redirect the Python log stream to some persistent storage, now it's also saved in memory. It should all be possible, only issue might be that BacktestResult was only recently opened up for extension (had private interface before) and I still might need to do some cleaning there.

Now, the negatives. Some might be debatable, but are my opinion at least. You lose multiprocessing. You probably lose reproducibility unless you are very careful to make sure the tables you refer to aren't modified. It's harder to debug, since potentially faulty operations like DB queries are done down in the internals.