Closed jsadler2 closed 3 years ago
For reference:
using .notnull()
In [1]: df
Out[1]:
0
0 -0.501162
1 -0.870652
2 NaN
In [2]: df.notnull()
Out[2]:
0
0 True
1 True
2 False
In [3]: df[df.notnull()]
Out[3]:
0
0 -0.501162
1 -0.870652
2 NaN
I thought that would work because this works when dealing with pandas Series and Numpy Arrays. I just assumed that it was working for pandas DF, too. 🤦
What I should be doing:
In [4]: df.dropna()
Out[4]:
0
0 -0.501162
1 -0.870652
.dropna()
seems like the best approach, but I think the issue with your .notnull()
code was that you needed to extract a simple array from the DataFrame returned by .notnull()
:
>>> df
vals
0 0.0
1 3.0
2 5.0
3 NaN
>>> df.notnull()
vals
0 True
1 True
2 True
3 False
>>> df[df.notnull()]
vals
0 0.0
1 3.0
2 5.0
3 NaN
>>> df.notnull().values.flatten()
array([ True, True, True, False])
>>> df[df.notnull().values.flatten()]
vals
0 0.0
1 3.0
2 5.0
Right. The masking approach drops the masked values with either a numpy
array (what you get when you call the .values
attribute) or a pandas Series. With pandas DataFrames (what I was using) the shape of the data is maintained and the masked values are simply changed to nan
(which in this case they were already nan
so we don't see a change.)
https://github.com/jsadler2/river-dl/blob/452d1ce5fef5d73504a76a37c2a54f2d3f715b4c/river_dl/preproc_utils.py#L265
I thought I was dropping
nan
values before reducing observations, but it appears that I wasn't.I should have been using
.dropna()
and not masking using.notnull()
- Still not sure why that is though