MiniXC / simple-back

A simple daily python backtester that works out of the box.
Mozilla Public License 2.0
59 stars 12 forks source link

making simple_back faster #9

Closed MiniXC closed 4 years ago

MiniXC commented 4 years ago

making simple-back faster

This issue is intended to keep track of efforts to improve simple-back performance, and will probably remain open for a while.

how slow is it?

At the moment, the quickstart example runs ~13 seconds with plotting enabled and ~1 second without (your mileage may vary). The difference is this big because plotting is blocking. async io or plotting in its own process with multiprocessing would help. But this is not the whole picture. Even without plotting, backtests with many different symbols can take minutes if not hours.

why is it slow?

Retrieving prices will always be the slowest part of any backtester that does not have the whole universe of stocks and their prices in memory. At the moment, we use disk caching and then cache prices again in memory once they are requested at least once. But this is done on a more abstract level than it should be, as illustrated below. Diagram The problem is that prices is often called with different dates, and _get_cached ends up with one cache entry for each date. The place where we should cache in memory is YahooPriceProvider, not DailyPriceProvider, while still making it easy for someone new to library writing a price provider without having to think to much about caching.

tasks

side-note on async io

The reason I think async io would be best in the long run is that we perform many tasks that have to wait for input (prices) at the moment. When e.g. buying all SP500 securities, a significant amount of time is spent for each order just waiting on the price, while in theory, all the orders could be executed at the same time and wait for their prices at the same time. Async IO could make syntax like this the new norm for buying multiple securities:

b.order_many(['ticker1', 'ticker2', 'ticker3', ...], [weight1, weight2, weight3, ...])

If we use async io under the hood, we can request all prices at once and don't have to wait for each one to arrive before ordering the next ticker.

1D0BE commented 4 years ago

It is important to note here that even with async io in place, attempting to pull all prices at once could result in a 429 TooManyRequests Error from the Yahoo FinanceDataAPI. An intelligent solution could be, to let the user specify before the backtest, which Tickers he might want to use, and preloading them into RAM.

MiniXC commented 4 years ago

This is a good point I hadn't considered. The fundamental question is if we always know what tickers we will use.

When we want to have as little overhead for the end user, and they want to e.g. implement a portfolio-weighing algorithm for SP500 tickers, they would have to write an additional loop at the beginning that steps through time and collects all the tickers that are ever in the SP500 and sets them to be loaded. Also loading all tickers at the beginning might feel faster, but cannot be faster than requesting them at runtime, as long as they are not blocking and are loaded into memory the first time they are requested.

Regarding TooManyRequests - another feature could be to have a method (or maybe even class with a Queue of requests to send) that could utilize retry. This is exactly the kind of stuff I want the end user to not worry about, but there is always the danger of introducing weird bugs if its not clear to the user that requests are being retried.

If someone is just implementing a crawler for a new data source, retrying requests if they fail is a very bad idea. I see two options here:

MiniXC commented 4 years ago

Fixed caching and moved plotting to its own thread. Plotting was actually slowing down things a lot more than expected, around 10x for me.

MiniXC commented 4 years ago

Closing this for now - will open a separate issue for async io.