backtrader2 / backtrader

Python Backtesting library for trading strategies
https://www.backtrader.com
GNU General Public License v3.0
225 stars 54 forks source link

High memory consumption while optimizing using InfluxDB data feed #10

Open vladisld opened 4 years ago

vladisld commented 4 years ago

I'm using the InfluxDB to store the history data. Naturally the InfluxDB data feed is used for backtesting and optimizing the strategy.

Trying to optimize the strategy on ~10years 5min data set with few parameters ranges (resulting in 90 iterations) got me out of memory on my dev machine ( 12 cores + 12GB mem )

Here the cerebro flags I was using:

    cerebro = bt.Cerebro(maxcpus=args.maxcpus,
                         live=False,
                         runonce=True,
                         exactbars=False,
                         optdatas=True,
                         optreturn=True,
                         stdstats=False,
                         quicknotify=True)

Analysis:

After a little bit of debugging the problem appears to be with InfluxDB data feed implementation which lacks a proper support for preload function.

In the current InfluxDB implementation the data from the influx database is loaded during the InfluxDB.start method and the result-set is kept in memory for the live time of the InfluDB instance. Even if cerebro preloads all the data, the result-set (which is no longer needed in such case) will still be in memory.

This is problematic when running optimization, where multiprocessing.Pool and Pool.imap is used for running the strategy with all its parameter permutations concurrently.

The way the multiprocessing.Pool works (the default method on Linux at least) is that the main process is simply forked for each worker process, where the latter inherits the main process memory (which will include the memory allocated for the aforementioned result-set the InfluxDB data feed). In addition, for each run of the strategy, the cerebro instance will be serialized (pickle-ized) and passed to the worker process - once again this will include the memory for the InfluxDB data feed since it is directly referenced by cerebro instance . This will unnecessarily increase the memory pressure during the optimization process

atulhm commented 4 years ago

Can you confirm the issue is on python client side and not on influx? Did a little bit research that has me wondering.

I can spend 30 minutes on Sunday 5/31 to work on this with you if it might be helpful.

vladisld commented 4 years ago

Thanks @atulhm for the interest in this item.

Please see my post on the community forum here

There are few problems actually, both with InfluxDB implementation as well as the way Cerebro instance is pickle-azed during the optimization run.

I do have a solution for both of those problems for quite a long time already in my fork - will try to provide a PR once a proper tests will be ready.