Closed MichalRIcar closed 1 year ago
The NaN will cause all further data points to be NaN:
In [7]: c = np.random.randn(100)
In [8]: ta.EMA(c)
Out[8]:
array([ nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, -0.20010195,
-0.14629603, -0.19322205, -0.11284334, -0.08979806, -0.09922255,
-0.00886839, 0.02769315, 0.01621726, 0.09114756, 0.0997015 ,
0.06605619, 0.00544552, 0.00368205, 0.12548891, 0.25605504,
0.29726329, 0.30825576, 0.23318661, 0.21047578, 0.0786684 ,
0.13126033, 0.05632908, 0.09003275, 0.12241143, 0.1368664 ,
0.18871725, 0.21911695, 0.25844934, 0.25344702, 0.27039838,
0.22954116, 0.26607002, 0.24966657, 0.1446552 , 0.16729002,
0.18874245, 0.09147541, 0.08174418, 0.19589613, 0.31269955,
0.2625159 , 0.35677455, 0.30980691, 0.30356763, 0.27470739,
0.24884736, 0.24725236, 0.23586413, 0.3471254 , 0.28572394,
0.36162036, 0.40297618, 0.30590914, 0.30621277, 0.25766893,
0.13943467, 0.24240124, 0.27100324, 0.10478242, 0.13936646,
0.15219173, 0.2476177 , 0.24939736, 0.24692743, 0.26384937,
0.21859485, 0.32376657, 0.35259057, 0.2076861 , 0.22139453])
In [9]: c[50] = np.nan
In [10]: ta.EMA(c)
Out[10]:
array([ nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, -0.20010195,
-0.14629603, -0.19322205, -0.11284334, -0.08979806, -0.09922255,
-0.00886839, 0.02769315, 0.01621726, 0.09114756, 0.0997015 ,
0.06605619, 0.00544552, 0.00368205, 0.12548891, 0.25605504,
0.29726329, 0.30825576, 0.23318661, 0.21047578, 0.0786684 ,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan])
I do think TA-Lib does the "more correct" thing, rather than ignoring NaN's which would seem to me to be wrong in most cases -- if you need the pandas behavior, replace the NaN's before calling TA-Lib, by either filtering them, propagating the previous data point, interpolating the data points, or however data-specific thing you need to deal with them.
You could also use pandas.DataFrame.fillna
, if your data is in pandas before calling into TA-Lib...
I do use this workaround → pandas.DataFrame.fillna ← to fill NA, however, this behavior is a bug - and can lead to serious issues in ML - so definitely worthy of sharing.. Especially when PANDAS result has expected behavior - 1 OBS can't produce 100% missing rate of xK of OBS - that is against statistical basics, ML , and, from mpov, common sense.
I do think TA-Lib does the "more correct" thing, rather than ignoring NaN's which would seem to me to be wrong in most cases -- if you need the pandas behavior, replace the NaN's before calling TA-Lib, by either filtering them, propagating the previous data point, interpolating the data points, or however data-specific thing you need to deal with them.
TA-Lib "ignores" NaN same way as pandas when NaN are from the beginning of the Data...thus, should behave same when NaN is randomly inside the Data
What does Pandas do by default when replacing the NaN in the rolling mean calculation?
And would you expect that to work for all users of TA-Lib? There's a reason fillna()
takes arguments.
In any event, this is a wrapper of the underlying C library, which behavior I'm not going to be able to change -- the only thing we could do is have some kind of automatic NaN handling, and I don't think that's a good idea.
In [12]: pd.Series([1,2,3,np.nan,4,5,6])
Out[12]:
0 1.0
1 2.0
2 3.0
3 NaN
4 4.0
5 5.0
6 6.0
dtype: float64
In [15]: s.rolling(3).mean()
Out[15]:
0 NaN
1 NaN
2 2.0
3 NaN
4 NaN
5 NaN
6 5.0
dtype: float64
So, it resets the rolling window and then waits for non-NaN values before generating new observations -- that's pretty nice. Too bad the underlying TA-Lib doesn't do that.
OK, thanks for the quick response, I believe worthy of sharing.
I agree, it’s something I’m used to now but is surprising and clearly it could be better.
I’ll make a note in the README about it.
On Wed, Sep 7, 2022 at 7:25 AM MichalRIcar @.***> wrote:
OK, thanks for the quick response, I believe worthy of sharing.
— Reply to this email directly, view it on GitHub https://github.com/mrjbq7/ta-lib/issues/542#issuecomment-1239463169, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAF5A637FJYHVH6RLVXK4LV5CQUXANCNFSM6AAAAAAQGZY7HU . You are receiving this because you commented.Message ID: @.***>
Added a note about NaN handling b0439e6e0565647e21b1ceee245bad4fac55aad6
Hello,
I have came to an issue where (E/S)MA produces extra high missing rate when just 1 obs is missing among many OBS (e.g. 2K). I have cross-checked with pd.rolling that this issue is a bug.
Reproduction is done in 2 steps
Bug reproduction:
STEP ZERO → Function Definition
Function
S1 → TALIB == PANDAS
Simulation(0) Output: TaLib Missing OBS: 11 Panda Missing OBS: 11
S2 → TALIB <> PANDAS
Simulation(1) Output: TaLib Missing OBS: 2000 Panda Missing OBS: 12
So that only 1 missing OBS at the second position of the Data caused 100% missing rate in TALIB case...