USGS-R / river-dl

Deep learning model for predicting environmental variables on river systems
Creative Commons Zero v1.0 Universal
21 stars 15 forks source link

dropping nulls with masking not working like i thought it would #75

Closed jsadler2 closed 3 years ago

jsadler2 commented 3 years ago

https://github.com/jsadler2/river-dl/blob/452d1ce5fef5d73504a76a37c2a54f2d3f715b4c/river_dl/preproc_utils.py#L265

I thought I was dropping nan values before reducing observations, but it appears that I wasn't.

I should have been using .dropna() and not masking using .notnull() - Still not sure why that is though

jsadler2 commented 3 years ago

For reference: using .notnull()

In [1]: df
Out[1]:
          0
0 -0.501162
1 -0.870652
2       NaN

In [2]: df.notnull()
Out[2]:
       0
0   True
1   True
2  False

In [3]: df[df.notnull()]
Out[3]:
          0
0 -0.501162
1 -0.870652
2       NaN

I thought that would work because this works when dealing with pandas Series and Numpy Arrays. I just assumed that it was working for pandas DF, too. 🤦

What I should be doing:

In [4]: df.dropna()
Out[4]:
          0
0 -0.501162
1 -0.870652
aappling-usgs commented 3 years ago

.dropna() seems like the best approach, but I think the issue with your .notnull() code was that you needed to extract a simple array from the DataFrame returned by .notnull():

>>> df
   vals
0   0.0
1   3.0
2   5.0
3   NaN
>>> df.notnull()
    vals
0   True
1   True
2   True
3  False
>>> df[df.notnull()]
   vals
0   0.0
1   3.0
2   5.0
3   NaN
>>> df.notnull().values.flatten()
array([ True,  True,  True, False])
>>> df[df.notnull().values.flatten()]
   vals
0   0.0
1   3.0
2   5.0
jsadler2 commented 3 years ago

Right. The masking approach drops the masked values with either a numpy array (what you get when you call the .values attribute) or a pandas Series. With pandas DataFrames (what I was using) the shape of the data is maintained and the masked values are simply changed to nan (which in this case they were already nan so we don't see a change.)