Is a feature toggle for substracting the warmup period possible?

Pirat83 commented 12 months ago

Hello @edtechre,

I have studied https://github.com/edtechre/pybroker/issues/48 a little bit more intensive. So from what I have understood is that there are 2 different start_date / end_date combinatoins.

Strategy(start_date, end_date) -> Is used to define the inteval of the data that is fetched
Strategy#backtest(start_date, end_date) -> Is used to defines backtest inteval

The Strategy#backtest(..., warmup) parameter is added to the start_date and then the Strategy starts trading from start_date + warump until end_date.

Here an example with the warmup period of 5 daily bars: https://github.com/Pirat83/pybroker-experiments/blob/master/main.py

                         cash    equity    margin  market_value     pnl  unrealized_pnl  fees
date                                                                                         
2023-01-03 05:00:00  10000.00  10000.00      0.00      10000.00    0.00             0.0   0.0
2023-01-04 05:00:00  10000.00  10000.00      0.00      10000.00    0.00             0.0   0.0
2023-01-05 05:00:00  10000.00  10000.00      0.00      10000.00    0.00             0.0   0.0
2023-01-06 05:00:00  10000.00  10000.00      0.00      10000.00    0.00             0.0   0.0
2023-01-09 05:00:00  10000.00  10000.00      0.00      10000.00    0.00             0.0   0.0
2023-01-10 05:00:00  10000.00  10000.00      0.00      10000.00    0.00             0.0   0.0
2023-01-11 05:00:00     30.70  10082.50      0.00      10082.50   82.50             0.0   0.0
2023-01-12 05:00:00     30.70  10251.35      0.00      10251.35  251.35             0.0   0.0
2023-01-13 05:00:00     30.70  10318.45      0.00      10318.45  318.45             0.0   0.0
2023-01-17 05:00:00     30.70  10305.80      0.00      10305.80  305.80             0.0   0.0
2023-01-18 05:00:00     30.70  10139.70      0.00      10139.70  139.70             0.0   0.0
2023-01-19 05:00:00     30.70  10042.90      0.00      10042.90   42.90             0.0   0.0
2023-01-20 05:00:00  10107.25  10107.25   9942.50       9961.75  107.25          -145.5   0.0

My chalenge is to make multiple strategies comparable. So they should all start trading on the 2023-01-01 regardless of there warmup period.

So it is a little bit complicated for me to calculate the concrete start_date - warmup so that the first trades are done exaclty on the 2023-01-01. On the daily timeframe we just have vacation days, weekend and days where the stock exchanges simply are closed. But thing start to get very messy if I want to multiply the indicator values by an timeframe multiplier to apply the daily logic on a lower timframe (i.e 390 when trading 1 min bars). In this scenario things get very complicated.

So is there an option to solve my issue without requiring a business day calendar and a list of days when the stock exchanges openend or not?

I think once the data is read the warmup period can be substracted and all indicators can be calculated. Doing so everything would be waruped before the Strategy.backtest(start_date, ....) and the first trades could happen on this date (if there is a day where the stock exchanges have opened - otherwise the next candle could be taken to start). In the example above the 2022-01-03 would be the date to start trading.

Ideally we could add an additional param to StrategyConfig to keep the APIs and the behavior backward compatible (if desiered). I hope that calculating the warumup then would be much easier (for me) and backtesting results would be more comparable.

What do you think about this idea? Or maybe there is an easy sollution for this, which I did not find until now?

Thank you for your time and your effort. I rally like your work.

edtechre commented 12 months ago

Hi @Pirat83,

Inside of your execution, you can check the bar's current date using ctx.date to see if your strategy should start trading. Does that help?

Pirat83 commented 12 months ago

Well I already implemented this check. Two additional issues arise in this case.

1) Not all Indicators are initialized. Sometimes an ExecutionContext is missing when trading. And until now I don't know why the bar data is missing at some points in time. Then the Strategy starts trading at a point somewhere in let's say April. This makes Strategies that rely on multiple instruments hard to implement. Let's say If RSI(20) of QQQ <= 30 buy SQQQ else buy TQQQ. In this situation shorting is an option. But in most other of my Strategies I don't have this option. 2) Some of the metrics require the correct start / end date. This is nice to have for me because we will use Quant Stats anyway. The metrics also require the risk free rate to be correct. So this issue is on my to-do list and for me optional.

I will change the example in the GIT repository to reflect the changes.

edtechre commented 12 months ago

It sounds like the problem you're having has to do with bars not being shared with all of your instruments. For instance, if you are using Strategy#set_before_exec or Strategy#set_after_exec, the ExecContexts will only be passed for instruments that have data for that bar.

Assuming that is true, there is not much PyBroker can do for you in that case since data is missing.

Pirat83 commented 12 months ago

Yeah I have experimented with a "forward fill DataSource adapter design pattern".

This is very complicated, since a RSI(20) on a QQQ is not comparable to a RSI(20) of an SPY if data is missing in one of those instruments in the last 20 candles. In this case we would need to take a RSI(19) for the other instrument. This is very complicated when you have multiple hunderts instuments in your strategy and also not very intuitive for the consumer, because the missing data on QQQ has impact on the RSI length of the SPY.

edtechre commented 12 months ago

Hi @Pirat83,

Have you considered modifying your data in Pandas first to fix the missing data issues you have? You can then use the Pandas DataFrame as a DataSource.

Pirat83 commented 12 months ago

HI @edtechre,

this is exacly what I have done (Don't be confused by the name of this adapter - in long term it should store and take data from Timescale DB):

from datetime import datetime
from typing import Optional

import pandas as pd
from pandas import DataFrame
from pybroker import DataCol
from pybroker.data import DataSource, Alpaca

from datetime import datetime
from typing import Optional

import pandas as pd
from pandas import DataFrame
from pybroker import DataCol
from pybroker.data import DataSource, Alpaca

class TimeScaleDBDataSource(DataSource):
    def __init__(self, delegate: Alpaca = None):
        super(TimeScaleDBDataSource, self).__init__()
        import os
        self.delegate = delegate if delegate is not None else Alpaca(os.getenv('ALPACA_KEY_ID'), os.getenv('ALPACA_SECRET'))

    @staticmethod
    def _fill_and_reset_index(group: DataFrame, start_date: datetime, end_date: datetime, timeframe: str):
        interval = pd.Timedelta(minutes=15)

        from pandas import DatetimeIndex
        index: DatetimeIndex = pd.date_range(start=start_date, end=end_date, freq=interval, tz='US/Eastern')

        group = group.reindex(index)
        group = group.ffill()
        group = group.bfill()

        group['date'] = group.index
        return group

    def _fetch_data(self, symbols: frozenset[str], start_date: datetime, end_date: datetime, timeframe: Optional[str], adjust: Optional[str]) -> pd.DataFrame:
        # noinspection PyProtectedMember
        result: DataFrame = self.delegate._fetch_data(symbols, start_date, end_date, timeframe, adjust)

        result = result.set_index(DataCol.DATE.value, drop=False)
        result = result.groupby(DataCol.SYMBOL.value).apply(self._fill_and_reset_index, start_date, end_date, timeframe)
        result = result.reset_index(drop=True)

        return result

I use the delegate (the original Alpaca#_fetch_data(...) method) to get the Data from Alpaca.

Then I regroup the DataFrame by symbol and create an syntetic index.

Forward filling and backward filling the DataFrame is not the most acurate way to handle this when dealing with a production environment but it is IMHO good enought for a backtesting framework - compared to hazzeling around with diffenent indicator length or other more error prone "sollutions". Backfilling the DataFrame introduces a look ahead bias but knowing this data is used only to warmup indicators that are not really used since the warmup period is in my case substracted from the start_date is is okay for me.

Pirat83 commented 12 months ago

If any one has similar issues I can share that code. And I am still convinced that solving those two challenges by the PyBroker Framework would help may people.

Thank you for your help. I appriciate your work very much. I have investigated many backtesting frameworks and PyBroker is top in terms of quality architecture and the codebase.

edtechre commented 12 months ago

Hi @Pirat83,

Can you clarify what the two challenges are and your proposed solution?

Pirat83 commented 11 months ago

Hi @edtechre, yes of corse:

1) Alligning start_date and end_date to improve comparing multiple strategies with different indicator length:

Pybroker drawio

See i.e: https://github.com/edtechre/pybroker/blob/master/src/pybroker/strategy.py#L226 Maybe this this property should have a different naming then warmup. Doing so each Strategy would start trading exactly on the same day (and therefore it is easier to compare multiple Strategies) and indicators are warmed up before the start_date of backtesting. Backtesting starts exacly at start_date with all indicators warmed up.

Please keep in mind >=, > or < and <= I am not sure yet which one to choose to be consistent with the rest of PyBrokers architecture.

2) Forward Filling and BackFilling the Pandas Dataframe to make Indicator usage easier / constistent when data is not present

Simply add https://github.com/edtechre/pybroker/blob/master/src/pybroker/data.py#L389 a ffill and bfill:

    @staticmethod
    def _fill_and_reset_index(group: DataFrame, start_date: datetime, end_date: datetime, timeframe: str):
        interval = pd.Timedelta(timeframe)

        from pandas import DatetimeIndex
        index: DatetimeIndex = pd.date_range(start_date, end_date, freq=interval, tz='US/Eastern')

        group = group.reindex(index)
        group = group.ffill()
        group = group.bfill()

        group['date'] = group.index
        return group

    def _fetch_data(self, symbols: frozenset[str], start_date: datetime, end_date: datetime, timeframe: Optional[str], adjust: Optional[str]) -> pd.DataFrame:
        # noinspection PyProtectedMember
        result: DataFrame = self.delegate._fetch_data(symbols, start_date, end_date, timeframe, adjust)

        result = result.set_index(DataCol.DATE.value)
        result = result.groupby(DataCol.SYMBOL.value).apply(self._fill_and_reset_index, start_date, end_date, timeframe)
        result = result.reset_index(drop=True)

        return result

I would strongly suggest to add an feature toogle to StrategyConfig, so people can decide if they want to use this. I.e bbfill adds data that has not existed in reality in such a way. And ffill need to be handled in a live trading environment anyway. This change could be done also in the other DataSources if there is the same challange like with Alpaca.

Pirat83 commented 11 months ago

I have found an issue in my code:

        group = group.reindex(index)

this needs better validation. It set's the whole group to Nan, when start_date / end_date does not start at 0:00h on a daily timeframe.

edtechre commented 7 months ago

Thank you for your thoughtful input, @Pirat83.

Alligning start_date and end_date to improve comparing multiple strategies with different indicator length

Can you explain what you mean by different indicator length here? Also what you mean by start_date_trade, end_date_trade?

Pirat83 commented 7 months ago

Hi @edtechre, Thank you for your time:

I need to select one Strategy to fit to the market conditions. Therefore I need to make all my strategies comparable across each other. I achieved this by skipping candles and then start backtesting from exactly the same candle.

Let's compare 2 Strategies: One uses the RSI 14 and one the RSI 21. And I am interested in the results of each Strategy per candle. We will start with the first trading day of the year this was the 2023-01-03.

In reality the Strategies would be much more complicated but for simplicity let's make it simple:

1) RSI 14: The RSI 14 Strategy requires 14 candles warmup to calculate the RSI(14). So It needs data from the 2022-12-09 until the 2022-12-30 to warm up. And then we can use the RSI value on the 2023-01-03, which is the first trading day in the new year.

2) RSI 21: The RSI 21 'Strategy' would need 21 candles and it needs the data from the 2022-11-30 to the 2022-12-2022 to complete the warmup and start trading on the 2023-01-03.

After the trading day 2023-01-03 I want store the metrics and the Portfolio in my TimeScaleDB and then I need to decide which Strategy should be used next. That's the stuff we are talking about in the other tickets. But that's not part of this ticket.

PyBroker currently adds the warmup period to the start_date instate of substracting it. So in the above 2 Strategies our first trading day would be the 2023-01-25 if we use the RSI(14) and 2023-02-02 if we use RSI(21). This is highly counterintuitive and misleading and it makes comparing Strategies much harder.

Pirat83 commented 7 months ago

So I have created a workaround:

I have extended the Strategy class and added start_date_trade, and end_date_trade.

So I can specify which candle sould be the start. Then I take all my indicators and calculate their lenght and then I use the maximum (and 30% buffer) to calculate the warmup period of PyBroker. The Warmup period is substracted from my start_date_trade to calculate PyBrokers start_date.

This approach also requires filtering the method provided to 'Strategy#add_execution' since it is not very elegant but it does for the moment what it should do. If you are interested I can provide also some code snippets. This will not be executable code but it should be enough to find a better sollution.

One sollution would be to change PyBrokers behavior i.e. with a feature toggle (to ensure backward compatibilty) to substract the warmup period instate of adding it.

edtechre commented 7 months ago

If you are interested I can provide also some code snippets. This will not be executable code but it should be enough to find a better sollution.

Yes, please share it!

One sollution would be to change PyBrokers behavior i.e. with a feature toggle (to ensure backward compatibilty) to substract the warmup period instate of adding it.

This makes sense to me, I can support this via an additional config option.

Pirat83 commented 7 months ago

Hi @edtechre - I have added some example code to https://github.com/Pirat83/pybroker-experiments/tree/issue-69-and-51

Pirat83 commented 7 months ago

Well in theory it is easy what I am doing. In practice it is a little bit complicated.

I have indicators on a daily chart i.e a SMA 200, SMA 50, SMA 20, etc... I take all indicators that I have and take the maximum - in this usecase 200. This value is stored in the warmup period variable. Then I substract the warmup from the start_date_time to fetch the data.

The Strategy#backtest(start_date=xxx, end_date=xxx, ...) method should then filter the candles where backtesting should actualy take place - but it is not so easy and I needed to implement it by my self. See: https://github.com/edtechre/pybroker/issues/69#issuecomment-1868544466

Please correct me if I understood something wrong.

Thank you very much for your time and work.

edtechre commented 4 months ago

Hi @Pirat83,

I had time to think about this more. Subtracting the warmup period will not work when querying data from a DataSource. The issue will be that the start_time will be needed to fetch data from a remote data source (i.e. Yahoo Finance, Alpaca). But it won't be (easily) possible to know the new start time that is subtracted by the warmup period before the data is fetched from the data source. But the offset start time needs to be known in order to query the data source in the first place.

What I would suggest doing is querying the DataFrame from a DataSource, and then subtracting the warmup period from your intended start date to find the offset start_date in the DataFrame to use for your backtest.

edtechre / pybroker

Is a feature toggle for substracting the warmup period possible? #51