Indicators and dataframes

rokups commented 1 year ago

I figured i will share some thoughts i have after trying PyBroker a bit.

Writing indicators could be more convenient i think. For example, in freqtrade it works like this: Bot calls populate_indicators() that we implement and passes entire dataframe to it. There we can do things like this:

df['spread'] = df['high'] - df['low']
df['spread_sma20'] = ta.SMA(df['spread'], 20)
df['spread_sma40'] = ta.SMA(df['spread'], 40)

This looks trivial on surface and of course is nothing PyBroker can not do, but actually this is very powerful.

To achieve something like this in PyBroker we would have to create a custom indicator functions for spread_sma20 and spread_sma40. But here we waste calculation of the spread column as it is done twice now.

It also is rather cumbersome to use indicator libraries like lib-ta or pandas_ta. These libraries already provide one-func-call indicators that we now must wrap in another function to acquaint them with PyBroker.

Normally i would just say "whatever, ill do it on symbol dataframe", however datasource.query() merges all symbols into one dataframe and thats the only place where it seems to make sense to insert custom indicators for backtesting.

What would be convenient

First of all it seems to me it would make more sense if data_source.query() returned a list of dataframes instead of one dataframe with all symbols. This dataframe need to be split anyway, besides merging dataframes of different symbols puts a burden on the user to make sure that dataframes of all queried symbols are of equal length and user must properly merge them in case there are missing candles. If everyone has to do it - might as well do it in the library.

Then, if dataframes were separate, we could also have a user-implemented indicators_fn(df) in the same spirit as exec_fn, which would allow massaging dataframe in any way we see necessary and utilizing all power of pandas.

This approach should be future-proof as well as adding support for multiple timeframes could be implemented as specifying indicator_fn for each timeframe. It should play well into live trading as well, since indicator_fn could be called once every new bar comes in.

Multi-symbol indicators

There is one special case where my proposed approach is not good enough: pairs trading. We need price data of two symbols in order to calculate necessary metrics. Maybe a way to get raw (just OHLC data, no indicators) symbol dataframe in indicator_fn could be an option. On same accord order entry for pairs trading is also bit unintuitive as entire process is split over two execute_fn iterations, but thats another topic..

Anyhow, by no means a request, just some food for thought and discussions. My proposition may have shortcomings that are unobvious to me.

edtechre commented 1 year ago

First off, thank you for your thoughtful input @rokups, it is much appreciated. My thoughts are below:

df['spread'] = df['high'] - df['low'] df['spread_sma20'] = ta.SMA(df['spread'], 20) df['spread_sma40'] = ta.SMA(df['spread'], 40)

This looks trivial on surface and of course is nothing PyBroker can not do, but actually this is very powerful.

To achieve something like this in PyBroker we would have to create a custom indicator functions for spread_sma20 and spread_sma40. But here we waste calculation of the spread column as it is done twice now.

PyBroker computes indicators in parallel using a process pool. To simplify this, the indicators are distributed across multiple processes for each ticker and indicator function pair. This means that there are no dependencies between indicators, making their computation easily parallelizable.

If you need to share custom data between indicators, you can register a custom data column with PyBroker and then create your own DataSource class or pass your own DataFrame to PyBroker. The Creating a Custom DataSource notebook shows how to do this. In your example, you would calculate the spread column in your DataFrame and then register it using pybroker.register_columns. The custom column will then be made available on the BarData instance passed to your indicator function.

It also is rather cumbersome to use indicator libraries like lib-ta or pandas_ta. These libraries already provide one-func-call indicators that we now must wrap in another function to acquaint them with PyBroker.

I am considering creating a wrapper around ta-lib. You should already be able to use pandas_ta by using a custom data source and registering custom columns, as explained previously. Perhaps I can add an example of pandas_ta to the custom DataSources notebook.

This dataframe need to be split anyway, besides merging dataframes of different symbols puts a burden on the user to make sure that dataframes of all queried symbols are of equal length and user must properly merge them in case there are missing candles. If everyone has to do it - might as well do it in the library. Then, if dataframes were separate, we could also have a user-implemented indicators_fn(df) in the same spirit as exec_fn, which would allow massaging dataframe in any way we see necessary and utilizing all power of pandas.

Creating multiple DataFrames would introduce extra overhead and complexity. External APIs for historical data are designed to return a single DataFrame to maintain simplicity and performance. However, a bigger concern is that having multiple DataFrames may not parallelize efficiently across multiple processes due to memory limitations and would also severely slow down serialization given PyBroker's current implementation. On the other hand, NumPy arrays can be mem-mapped across processes with ease and can be accelerated using Numba.

There is one special case where my proposed approach is not good enough: pairs trading. We need price data of two symbols in order to calculate necessary metrics.

You can retrieve the indicator of another symbol using ExecContext#indicator(), as well as OHLCV + custom column data with ExecContext#foreign().

I agree that support for multi-symbol indicators would make sense. It is something that I considered during the design phase, but I limited the implementation to single-symbol indicators for the sake of simplicity in the initial release (V1). I need to give this more thought, but my plan would be to add support for multi-symbol indicators as a configuration option that groups data for all symbols per indicator. If you have any suggestions, please let me know. In the meantime, you can calculate the multi-symbol indicator outside of PyBroker, save it to a DataFrame column, and then register the custom column with PyBroker.

rokups commented 1 year ago

Hmm what you say does make sense...

I am considering creating a wrapper around ta-lib

Here is a little help on that: talibgen.py.txt

This is an updated and fixed script from https://github.com/TA-Lib/ta-lib-python/pull/212/, should simplify the process.

edtechre commented 1 year ago

Great, thank you!

edtechre commented 1 year ago

After reviewing TA-Lib again, I am unsure if creating a wrapper for it adds significant value. It's already fairly straightforward to integrate TA-Lib with PyBroker by using lambdas as shown in the following example:

import talib

rsi_20 = pybroker.indicator('rsi_20', lambda data: talib.RSI(data.close, timeperiod=20))
rsi_20(df)

I added this example to the Writing Indicators notebook.

rokups commented 1 year ago

Hmm i did not think about using lambdas. Thank you for the example. I suppose this is solved then?

JevonYang commented 2 months ago

I need to use a lot of indicators in pybroker, which are taken directly from datasource. Is there a way to quickly register these indicators? Or can I get the whole dataframe from the datasource directly in context without registering them?

edtechre commented 2 months ago

Hi @JevonYang,

You can register the indicator columns in your dataframe using pybroker.register_columns.

edtechre / pybroker

Indicators and dataframes #5

What would be convenient

Multi-symbol indicators