Closed PyMap closed 3 years ago
Changing the way the drop_zero
iis applied as follows:
def _drop_zeros(df):
"""
Drop zeros from a DataFrame, returning an iterator over the columns
in the DataFrame.
Yields tuples of (column name, non-zero column values, non-zero indexes).
Parameters
----------
df : pandas.DataFrame
"""
def for_each_col(col):
nz = col.to_numpy().nonzero()[0]
return col[nz], nz
for (col_idx, (col, nz)) in df.apply(lambda row : for_each_col(row), axis=1).items():
yield (col_idx, col, nz)
non_zero
method has been deprecated in pandas, so I change row value to_numpy
and then apply non_zeroThis returns, apparently, the desired result:
a) Same amount of results (same length of expected tuples than rows in dataframe used as input of the non zero function):
b) Each returned tuple has:
b1: column name (0 in the example) b2: non zero value (1 in the example) b3: non zero idx (row 0 in the example)
If comparing this result with the dataframe above, we can see that values matches correctly in column 0
.
To be completely sure I will make a new branch and test it. (@msoltadeo @janowicz )
Thanks @PyMap for all the detective work! I'll test this now.
Great @cvanoli !
This is what I finally did:
def _drop_zeros(df):
"""
Drop zeros from a DataFrame, returning an iterator over the columns
in the DataFrame.
Yields tuples of (column name, non-zero column values, non-zero indexes).
Parameters
----------
df : pandas.DataFrame
"""
def for_each_col(col):
nz = col.to_numpy().nonzero()[0]
return col[nz], nz
for (col_idx, (col, nz)) in df.apply(lambda row: for_each_col(row), axis=0).items():
yield (col_idx, col, nz)
I have been running to exactly the same issue and PyMap's change has fixed it. Thank you PyMap. I would suggest that this solution is merged with the develop (default) branch.
I've been doing further verifications to be sure that proposed changes in PR65 correctly solves pandas incompatibilities since >=0.15.0 versions.
While proportional updating process run, the _FrequencyAndConstraints
class is instatiated by using frequency and contraints household tables. Here, the _drop_zeros
function is applied to the frequencies dataframe and returns indexes for non zero values that are also used to filter the constraints Series (both shares the household idx).
To achieve this,_drop_zeros
uses the nested function for_each_col
whose main objective is to return each column of the frequencies dataframe filtered. Main reason of the crash here is that in recent pandas the apply method tries to store arrays of different lengths when for_each_col
receives the frequencies table Series.
When for_each_col is applied, raw parameter could be set as True or False depending on the way that nonzero is applied to each Series. Given that this last object doesn't support anymore the previous operation it is necessary then to cast it as a numpy array (which can be achieved with raw=True , using to_numpy as suggested in official documentation or directly with .values method).
The expected result for the _drop_zeros function is a generator object containing tuples with:
(col_idx, --> this is the number of the column in grequencies table, which is also the name
(col, --> array of non zero values,
nz) --> row indexes of non zero values
)
With older pandas versions, the raw = True was able to return that expected structure as follows:
result = df.apply(for_each_col, axis=0, raw=True).items()
>>> result[21984]
(21984, --> col_idx
(array([1., 1., 1., 1., 1., 1.]), --> col
array([ 0, 271, 1181, 1601, 2002, 2531]) --> nz))
By applyling for_each_col with axis=0 orientation, will always return a (2, n) shape
. First dimension corresponding to the row/col indexes in frequencies dataframe and n being an aleatory result that could have different lengths depending on nonzero filtering. Storing arrays of different shapes is, apparently, what is no longer supported in recent pandas.
So, instead of using the raw = True
while df.apply(for_each_col)
to directly receive a numpy array, the workaround I tried is to cast the pandas Series as array before applying nonzero and then filter the Series. As follows:
raw 'False' - using ".values" instead of to_numpy for Travis environment compatibility
-----------------------------------------------------------------------------------------------------------------
def for_each_test(col):
nz = col.values.nonzero()[0]
return col.iloc[nz], nz
result = df.apply(for_each_col, axis=0, raw=False).items()
>>> result[21984]
(21984, --> col_idx
(hh_id --> col
0 1.0
271 1.0
1181 1.0
1601 1.0
2002 1.0
2531 1.0
Name: 21984, dtype: float64,
array([ 0, 271, 1181, 1601, 2002, 2531]) --> nz
)
)
This way, we avoid having the shape error I mentioned before and we assure to return the expected results.
Setting up synthpop in a virtualenv working with python 3.7 was returning the following error:
When the
hh
orp
cols created here and here are unpacked here the result of the drop zeros function raises the error above.I've solved by modyfing the
setup.py
pandas version, downgrading from pandas==1.1.0rc to 1.0.5.Since the
drop_zeros
function was applied to a dataframe and results haven´t the same lentgh unpacking the yield wrapper raises the posted error.