Add support for Polars Dataframe

teneon commented 2 years ago

Hi there,

thank you for this great wrapper lib :-)

Would it be possible to add support for Polars as you have added support for Pandas? Polars is a rising star Dataframe, which is much faster than Pandas, therefore we are switching everything over to Polaras.

If you are using Pandas, you can simply pass Dataframe as an argument, like this.

import talib.abstract as ta
# TEMA - Triple Exponential Moving Average
dataframe['tema'] = ta.TEMA(dataframe, timeperiod=9)

Currently, it is not possible to directly pass Polars Dataframe the same way as you can pass Pandas. Could you perhaps add support for Polars as well?

best regards, Neon

mrjbq7 commented 2 years ago

It's a cool idea.

Is there a way to get direct access to the column data in memory, so I can pass a pointer directly to the TA-Lib C functions?

Or would I need to copy the data out, into a numpy array, or something...?

ritchie46 commented 2 years ago

hi @mrjbq7 author of Polars here.

The polars columnar data is in the apache arrow format. For numerical data it means we have numerical values, similar to a numpy array, but the missing values, are not represented by a NaN as in pandas, but are represented by a separate validity bits buffer.

As I don't think ta-lib works on arrow data, its best to call pl.Series(..).to_numpy(). This converts to numpy and based on the null count, may be zero copy.

E.g. if there is no null data, numpy just takes a pointer to the arrow data, if there is a null data, the data is copied to a numpy buffer and the missing values are set as np.nan. Hope this helps.

mrjbq7 commented 2 years ago

Hi @ritchie46, thank you for that information, very helpful. And polars looks really neat!

Hi @teneon, want to grab the latest git master and see if the polars support works for you?

I added initial support in 943c99e6657b5c4f99e9f9fb296ed09eb9eb0a68 and some test cases showing it works in test_polars.py.

Now that we are supporting more types, I might want to change how __init__.py does the pandas/polars wrapper function before releasing.

mrjbq7 commented 2 years ago

I just merged the wrappers, it was doing the wrong thing. Now all the tests pass.

ritchie46 commented 2 years ago

I just merged the wrappers, it was doing the wrong thing. Now all the tests pass.

Thanks for your quick support on this. :)

teneon commented 2 years ago

Hi @mrjbq7 thank you so much for your quick integration, awesome. Sorry for my late response, i've missed your reply somehow. Anyway, i started testing right away, but there seems to be a problem.

I have installed the latest version like this, so installation should be all good, right?

pip install git+https://github.com/mrjbq7/ta-lib.git
Collecting git+https://github.com/mrjbq7/ta-lib.git
  Cloning https://github.com/mrjbq7/ta-lib.git to /tmp/pip-req-build-dzj4oe0t
  Running command git clone --filter=blob:none -q https://github.com/mrjbq7/ta-lib.git /tmp/pip-req-build-dzj4oe0t
  Resolved https://github.com/mrjbq7/ta-lib.git to commit 4bae024eed765d0279d22937dda58d7c177499ed
  Preparing metadata (setup.py) ... done
Requirement already satisfied: numpy in ./venv/lib/python3.9/site-packages (from TA-Lib==0.4.22) (1.21.2)
Using legacy 'setup.py install' for TA-Lib, since package 'wheel' is not installed.
Installing collected packages: TA-Lib
    Running setup.py install for TA-Lib ... done
Successfully installed TA-Lib-0.4.22

In the code below i have created Polars dataframe and filled it with with some random OHLCV data.

import polars as pl
import talib.abstract as ta
size = 50
df = pl.DataFrame(
    {
        "open": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
        "high": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
        "low": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
        "close": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
        "volume": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32")
    }
)

print(df)
df['tema'] = ta.TEMA(df, timeperiod=9)

When i execute this code, i get the following error (basically the same error as in version 0.4.21) It seems like it still doesn't accept Polars dataframe as an argument, while it accepts Pandas dataframe properly.

df['tema'] = ta.TEMA(df, timeperiod=9) File "talib/_abstract.pxi", line 398, in talib._ta_lib.Function.call File "talib/_abstract.pxi", line 277, in talib._ta_lib.Function.set_function_args File "talib/_abstract.pxi", line 462, in talib._ta_lib.Function.__check_opt_input_value TypeError: Invalid parameter value for timeperiod (expected int, got DataFrame)

Let me know what do you think please and if should i test anything else?

Best regards, Neon

mrjbq7 commented 2 years ago

So, two problems!

1) I didn't regenerate the cython C code, I'll do that, but...

2) I assumed this would work in Polars:

>>> 'close' in df
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/lib/python3.10/site-packages/polars/eager/series.py", line 309, in __eq__
    other = _maybe_cast(other, self.dtype)
  File "/opt/homebrew/lib/python3.10/site-packages/polars/eager/series.py", line 120, in _maybe_cast
    el = _DTYPE_TO_PY_TYPE[dtype](el)
ValueError: could not convert string to float: 'close'

Instead, I need to make sure to do "close" in df.columns... let me confirm this is all that's required and then I might ask you to check again. Thanks!

mrjbq7 commented 2 years ago

Please try again!

I suggest, because I got your error initially, to make sure you've removed the old TA-Lib python module before installing new from git master.

You can confirm with:

>>> import talib as ta
>>> ta.__version__
'0.4.22'

Then you should get a result similar to:

>>> type(df)
<class 'polars.eager.frame.DataFrame'>

>>> ta.TEMA(df, timeperiod=9)
shape: (50,)
Series: '' [f64]
[
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    ...
    62.26438604056194
    50.626628295484245
    48.611178954077026
    29.72494193839168
    12.95542628898194
    9.21775304343901
    29.509385408017167
    45.367104334273634
    36.00893422271
    55.292373328126864
    45.365085926466364
    58.50434660710566
]

mrjbq7 commented 2 years ago

@ritchie46 you might want to consider changing the behavior of "close" in df.

@teneon I have the polars test cases that are run successfully on Travis CI here:

https://github.com/mrjbq7/ta-lib/blob/master/talib/test_polars.py

ritchie46 commented 2 years ago

@ritchie46 you might want to consider changing the behavior of "close" in df.

@teneon I have the polars test cases that are run successfully on Travis CI here:

https://github.com/mrjbq7/ta-lib/blob/master/talib/test_polars.py

Sorry, i don't fully understand. What were you trying to do and what kind of error did you get regarding "close"? Have you got a small example?

mrjbq7 commented 2 years ago

Trying to see if the polars.DataFrame contains a column named "close"... that code would work for pandas.DataFrame.

>>> import polars as pl
>>> import numpy as np

>>> size = 50

>>> df = pl.DataFrame(
    {
        "open": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
        "high": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
        "low": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
        "close": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
        "volume": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32")
    }
)

>>> "close" in df.columns
True

>>> "close" in df
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/lib/python3.10/site-packages/polars/eager/series.py", line 309, in __eq__
    other = _maybe_cast(other, self.dtype)
  File "/opt/homebrew/lib/python3.10/site-packages/polars/eager/series.py", line 120, in _maybe_cast
    el = _DTYPE_TO_PY_TYPE[dtype](el)
ValueError: could not convert string to float: 'close'

ritchie46 commented 2 years ago

Check. Yes, the first snippet is valid: "close" in df.columns.

This snippet does not make any sense in polars: "close" in df. Polars syntax is quite different from pandas.

mrjbq7 commented 2 years ago

Sounds good!

Thanks for the great product, I just wanted to point out that one difference. Perhaps a "how to migrate from pandas" type document might someday include it.

ritchie46 commented 2 years ago

Sounds good! Thanks for the great product, I just wanted to point out that one difference. Perhaps a "how to migrate from pandas" type document might someday include it.

And thank you for supporting Polars as well. ;)

teneon commented 2 years ago

@mrjbq7 Just to update you. We have been testing your latest update yesterday with abstract API and direct functions as well. It works much better now, thanks :package:

Anyway..., it looks like there is one problem left though, but i will get back to you on that, because we need to make sure first that the problem is not in our own code (we are not 100% sure yet). We will try to figure it, simplify our code, prepare simple examples and i will get back to you.

best regards, Neon

teneon commented 2 years ago

@mrjbq7 Hi there ;) We've prepared a simple example to demonstrate the problem. The issue is when you use two or more dataframes and then call talib function on two of them. Take a look at this:

import talib.abstract as ta

size = 5
df1 = pl.DataFrame(
    {
        "open": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
        "high": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
        "low": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
        "close": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
        "volume": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32")
    }
)
df2 = pl.DataFrame(
    {
        "open": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
        "high": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
        "low": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
        "close": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
        "volume": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32")
    }
)

ta.TEMA(df1, timeperiod=9)
ta.TEMA(df2, timeperiod=9)

as soon as you call ta.TEMA the second time ,on the second dataframe it crashes.

UPDATE: I just figured out that the problem always occurs when you call the same "ta" function twice, even on the same dataframe, e.g.

ta.TEMA(df1, timeperiod=9)
ta.TEMA(df1, timeperiod=9)

best regards, Neon

mrjbq7 commented 2 years ago

This should be fixed in 302e0e65d0ed0bdd6471681a27adb0172f5a7db4, my apologies!

Added the related test case to confirm.

mrjbq7 commented 2 years ago

@teneon do you think this is ready to release?

teneon commented 2 years ago

Hi @mrjbq7 , sorry for long reply, we've been coding. It looks quite good i think. We've also benchmarked it (quickly) and it calculates the same indicators about 50% faster than Pandas, which is also great (probably because of zero copy numpy?). I did find one more issue, but this has nothing to do with Polars implementation. I will prepare an example and show it to you so you can give your input about what u think. I will try to prepare it today! Thanks for adding Polars support in your talib, it's amazing :D

ritchie46 commented 2 years ago

indicators about 50% faster than Pandas, which is also great (probably because of zero copy numpy?)

Pandas is definitely zero copy. So I don't think that's the case. Maybe because polars parallelizes expression execution?

mrjbq7 commented 2 years ago

Pandas used to provide numpy compatible arrays of column data, and then changed their API so the cost to represent as Numpy is now slowing it down. I should find out if there is a more efficient way to get a pointer to the start of the data.

On Nov 10, 2021, at 7:46 AM, Ritchie Vink @.***> wrote:

indicators about 50% faster than Pandas, which is also great (probably because of zero copy numpy?)

Pandas is definitely zero copy. So I don't think that's the case. Maybe because polars parallelizes expression execution?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

teneon commented 2 years ago

@mrjbq7 Hi there :) Sorry for a late reply, we've been very busy coding, but we have been using your Talib Polars implementation all the time in the meanwhile. We have been comparing Polars implementation results with Pandas implementation results side by side all this time and it seems like everything works fine.

In case you haven't released it yet, i believe it can be released. And if we find any bugs we will report it. You are the best, thank you!

mrjbq7 commented 2 years ago

Awesome, thanks!

Released it as version 0.4.22!

collinsethans commented 1 year ago

@mrjbq7

First, thank you for the talib-python library (and polars too!). I am 2 days new to polars and talib is the first library that I am using :)

The test_polars.py file is currently failing. Here is a snippet from the file:

>>> import talib
>>> values = pl.Series([90.0,88.0,89.0])
>>> talib.MOM(values, timeperiod=1)
Traceback (most recent call last):
  ... snip ...
  talib/__init__.py", line 27, in wrapper
    return func(*args, **kwargs)
TypeError: Argument 'real' has incorrect type (expected numpy.ndarray, got Series)

The usage of talib APIs with .map() also fails. I am not sure if these failures are due to changes in talib-python or in polars. Hence, posting it here.

Float64: I notice that talib-py is forcing input of dtype Float64. Can this be relaxed to Float32? For the Financial data, we have 10s of thousands of dataframes (~5K stocks x mode:eod/intraday x several-timeframes) and saving space on both SSD and on RAM (especially during intraday) is important to us. (we actually have been storing in Float32 and haven't faced any inconsistencies).

mrjbq7 commented 1 year ago

It is working for me:

➜  ~ python3
Python 3.11.1 (main, Dec 23 2022, 09:28:24) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import talib
>>> import polars as pl
>>> values = pl.Series([90.0,88.0,89.0])
>>> talib.MOM(values, timeperiod=1)
shape: (3,)
Series: '' [f64]
[
    NaN
    -2.0
    1.0
]
>>>

I'm happy to make adjustments to the API. Regarding float32 versus float64. The TA-Lib library supports C functions with double arguments and float arguments, and I am currently only wrapping the 64-bit versions. It would probably be easy to wrap both with minimal effort, and then I guess would want something to dispatch to the 64-bit or 32-bit based on the polars datatype?

>>> import polars as pl
>>> pl.__version__
'0.16.2'
>>> import talib as ta
>>> ta.__version__
'0.4.25'

collinsethans commented 1 year ago

@mrjbq7

Sigh! My mistake - with the version. Yesterday I had created a new conda env with ta-lib and polars and little did I anticipate that conda-forge is stuck with old version of both of them. Thanks, I did pip install for both of them and they are working, as expected.

Float32: It will be a good option to auto swtich based on input dtype. Will look forward to it as you release.

Thank you.

collinsethans commented 1 year ago

@mrjbq7 (and others)

I had 1 more query for you regarding TAlib usage with polars.

The current method of using an indicator is applying it on the whole input column for a full computation. However, for financial data timeframes where there are 10s of thousands of rows, this will likely be costly when the need is only to compute for the latest 1 row of new ohlcv data. With the intent to compute the indicator in a continuous form, I need to provide additional immediately preceeding rows - eg. say 50 backfill rows for EMA (assuming I don't use polars ema) using when/then/otherwise - but want to get only the last row's value inserted in the respective indicator column.

Currently, I find the only option is to create a temp column with input 1+50 backfill data in above way, and run on it in second step with another when/then/otherwise to filter out only the last row's computed value. This will be costly if I have to do for 4-5K stocks with multiple timeframes, especially if realtime update freq is 1 min.

Do you have any suggestion on alternate methods for this in polars?

mrjbq7 commented 1 year ago

We have a stream version of the indicators that generate a "latest value":

https://github.com/TA-Lib/ta-lib-python#streaming-api

It works great, but you should know that some indicators have memory and the "latest values" generated by looking at the minimal number of past observations to generate them might have slight numerical differences from that value generated from a larger dataset.

mrjbq7 commented 1 year ago

Regarding the f32 data type, is your expectation that the output would be generated by doing 32-bit calculations without converting the input data or the output to 64-bit?

collinsethans commented 1 year ago

We have a stream version of the indicators that generate a "latest value":

I actually had a glance at streaming before, but didn't indulge more as it's marked as experimental (may hinder our s/w it to production stage). I will still have a look again. A couple of queries:

Is it tested enough to ensure no memory leaks exist?
A good point you mention is slight deviation from larger dataset. Will you know how much will be the deviation? (If not, I will try it out myself). For us, 2 decimal points of accuracy seems to be ok as of now. Not sure of future needs though, if we need a 100 multiplier for any indicator.
TA-lib is no more supported since the last 9/10 yrs. Will there be any chance streaming support to mature beyond experimental state?

Thanks much for your quality support :)

https://github.com/TA-Lib/ta-lib-python#streaming-api

It works great, but you should know that some indicators have memory and the "latest values" generated by looking at the minimal number of past observations to generate them might have slight numerical differences from that value generated from a larger dataset.

mrjbq7 commented 1 year ago

The streaming API isn't really experimental per-se, except as so far as I wanted to see if people found it useful.

It has no memory leaks, and is in fact just the same as the function API, except we tell TA-Lib to look only at the necessary amount of lookback to generate a new value.

mrjbq7 commented 1 year ago

Regarding TA-Lib support, I have been working with the original author to transition the project to github under the TA-Lib organization (https://github.com/ta-lib/ta-lib) and hope to make a new release at some point with some of the community improvements.

collinsethans commented 1 year ago

Regarding the f32 data type, is your expectation that the output would be generated by doing 32-bit calculations without converting the input data or the output to 64-bit?

@mrjbq7 We will be using Float32 as input data and expect return in Float32 - gets saved in Float32 column. Our objective is having most of the columns in Float32, where possible. Internally, TAlib can cast the data to Float64 for calculations, preferably in C function calls. (Do you expect any issue there?)

Let me know if I haven't answered you.

collinsethans commented 1 year ago

The streaming API isn't really experimental per-se, except as so far as I wanted to see if people found it useful.

Glad to get your confidence on it. Will definitely try it.

It has no memory leaks, and is in fact just the same as the function API, except we tell TA-Lib to look only at the necessary amount of lookback to generate a new value.

lookback - I hope we can give the lookback period for the respective TAlib indicator call. We do maintain tested lookback periods in config table, deduced from our testings. So, being able to specify it helps.

collinsethans commented 1 year ago

Regarding TA-Lib support, I have been working with the original author to transition the project to github under the TA-Lib organization (https://github.com/ta-lib/ta-lib) and hope to make a new release at some point with some of the community improvements.

Very happy to know that TA-lib C lib is getting your support! Can I take this opportunity to request 2 things:

Rust bindings
Better candlestick support. It's not possible to call all 50+ candlestick indicators APIs in realtime intraday. Any better way of doing it will be very helpful. And I think people typically will store it's return as bitmask - we do so (for EOD).

mrjbq7 commented 1 year ago

lookback - I hope we can give the lookback period for the respective TAlib indicator call. We do maintain tested lookback periods in config table, deduced from our testings. So, being able to specify it helps.

You shouldn't need a config table like that -- TA-Lib provides a function to calculate it for each indicator, and then you just pass your entire array:

>>> import numpy as np
>>> import talib as ta

>>> # random prices
>>> c = np.random.randn(100)

>>> # the last data point calculated using the Function API
>>> ta.MOM(c)[-1]
0.17279821427623404

>>> # the last data point calculated using the Streaming API
>>> ta.stream.MOM(c)
0.17279821427623404

>>> # those previous two are identical, but indicators with memory
>>> # like EMA can have different values
>>> ta.EMA(c)[-1]
0.02687581489039577

>>> # versus streaming, which only looks at the lookback period
>>> ta.stream.EMA(c)
0.0026940228851210944

>>> # see, you can replicate by using the fewest possible observations
>>> ta.EMA(c[-30:])[-1]
0.0026940228851210944

collinsethans commented 1 year ago

@mrjbq7

Sorry for the delayed reply and thank you much for samples of stream APIs with explanation. Much appreciated.

For the ones with memory, like the EMA, we find that we will not be able to use them. We need the values to be same as the Function API and with stream there's a good deviation. So, what we currently do with Function API by providing minimal data for continuity is using a lookback period (as I had mentioned earlier), like:

# lookback: 170 (just a sample val)
ta.EMA(c[-200:])[-1]

With polars, using the expressions with .map(), we are facing the problem as it needs us to create a temporary column - poses time overhead risk with realtime intraday.

Another thing that is missing with stream API is that it returns only the last value. Now, there can be scenarios where last 1/2/few data were not received, say for network issue, or app restart, a practical possibility. In such cases, the app identifies the missing rows, gets backfill data and needs now last 2/3/few computations (which with stream API will require multiple calls).

Essentially, our need is, like:

# lookback: L
# backfill count: n
# timeperiod: 30
ta.EMA(c[-30 - L + 1:])[-n]

Thanks again for your help!

tlk3 commented 8 months ago

Getting the same error as mentioned above, but pretty sure I have the correct version installed. Thoughts?

df = pl.DataFrame(...)

high = df['high'] low = df['low'] close = df['close'] atr = talib.ATR(high, low, close, timeperiod=14)

Traceback (most recent call last): File "/Users/TLK3/PycharmProjects/stratbot2/venv/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3550, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 7, in atr = talib.ATR(high, low, close, timeperiod=14) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/TLK3/PycharmProjects/stratbot2/venv/lib/python3.12/site-packages/talib/init.py", line 64, in wrapper result = func(*_args, **_kwds) ^^^^^^^^^^^^^^^^^^^^^ TypeError: Argument 'high' has incorrect type (expected numpy.ndarray, got Series)

mrjbq7 commented 8 months ago

What version of ta-lib do you have? and what version of polars?

That works for me:

>>> import numpy as np
>>> import polars as pl
>>> import talib as ta

>>> size = 50
>>> df = pl.DataFrame(
...         {
...             "open": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
...             "high": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
...             "low": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
...             "close": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32"),
...             "volume": np.random.uniform(low=0.0, high=100.0, size=size).astype("float32")
...         }
...     )

>>> high = df['high']

>>> low = df['low']

>>> close = df['close']

>>> ta.ATR(high, low, close, timeperiod=14)
shape: (50,)
Series: '' [f64]
[
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    …
    40.476675
    41.916938
    43.717916
    44.007206
    46.32909
    47.186069
    48.298291
    48.115971
    51.015277
    50.433759
    50.355243
    51.364692
    49.708971
]

>>> ta.__version__
'0.4.28'

>>> np.__version__
'1.23.2'

>>> pl.__version__
'0.20.3'

TA-Lib / ta-lib-python

Add support for Polars Dataframe #471