UDST / synthpop

Synthetic populations from census data
BSD 3-Clause "New" or "Revised" License
99 stars 47 forks source link

Failure with last pandas version in `ipu.py` #63

Closed PyMap closed 3 years ago

PyMap commented 3 years ago

Setting up synthpop in a virtualenv working with python 3.7 was returning the following error:

File "/home/fedec/urbansim/spop/synthpop/synthpop/ipu/ipu.py", line 28, in _drop_zeros
    for (col_idx, (col, nz)) in df.apply(for_each_col, axis=0, raw=True).items():
  File "/home/fedec/urbansim/spop/synthpop/venv/lib/python3.7/site-packages/pandas-1.1.0rc0-py3.7-linux-x86_64.egg/pandas/core/frame.py", line 7541, in apply
    return op.get_result()
  File "/home/fedec/urbansim/spop/synthpop/venv/lib/python3.7/site-packages/pandas-1.1.0rc0-py3.7-linux-x86_64.egg/pandas/core/apply.py", line 178, in get_result
    return self.apply_raw()
  File "/home/fedec/urbansim/spop/synthpop/venv/lib/python3.7/site-packages/pandas-1.1.0rc0-py3.7-linux-x86_64.egg/pandas/core/apply.py", line 219, in apply_raw
    result = np.apply_along_axis(self.f, self.axis, self.values)
  File "<__array_function__ internals>", line 6, in apply_along_axis
  File "/home/fedec/urbansim/spop/synthpop/venv/lib/python3.7/site-packages/numpy-1.19.1-py3.7-linux-x86_64.egg/numpy/lib/shape_base.py", line 402, in apply_along_axis
    buff[ind] = asanyarray(func1d(inarr_view[ind], *args, **kwargs))
ValueError: could not broadcast input array from shape (2,2) into shape (2,0)

When the hh or p cols created here and here are unpacked here the result of the drop zeros function raises the error above.

I've solved by modyfing the setup.py pandas version, downgrading from pandas==1.1.0rc to 1.0.5.

Since the drop_zeros function was applied to a dataframe and results haven´t the same lentgh unpacking the yield wrapper raises the posted error.

PyMap commented 3 years ago

Changing the way the drop_zero iis applied as follows:

def _drop_zeros(df):
    """
    Drop zeros from a DataFrame, returning an iterator over the columns
    in the DataFrame.

    Yields tuples of (column name, non-zero column values, non-zero indexes).

    Parameters
    ----------
    df : pandas.DataFrame

    """
    def for_each_col(col):
        nz = col.to_numpy().nonzero()[0]
        return col[nz], nz

    for (col_idx, (col, nz)) in df.apply(lambda row : for_each_col(row), axis=1).items():
        yield (col_idx, col, nz)
  1. non_zero method has been deprecated in pandas, so I change row value to_numpy and then apply non_zero
  2. Use apply with anonymous function and change axis orientation.

This returns, apparently, the desired result:

a) Same amount of results (same length of expected tuples than rows in dataframe used as input of the non zero function): image

b) Each returned tuple has: image

b1: column name (0 in the example) b2: non zero value (1 in the example) b3: non zero idx (row 0 in the example)

If comparing this result with the dataframe above, we can see that values matches correctly in column 0.

To be completely sure I will make a new branch and test it. (@msoltadeo @janowicz )

cvanoli commented 3 years ago

Thanks @PyMap for all the detective work! I'll test this now.

PyMap commented 3 years ago

Great @cvanoli !

This is what I finally did:

def _drop_zeros(df):
    """
    Drop zeros from a DataFrame, returning an iterator over the columns
    in the DataFrame.
    Yields tuples of (column name, non-zero column values, non-zero indexes).
    Parameters
    ----------
    df : pandas.DataFrame
    """
    def for_each_col(col):
        nz = col.to_numpy().nonzero()[0]
        return col[nz], nz

    for (col_idx, (col, nz)) in df.apply(lambda row: for_each_col(row), axis=0).items():
        yield (col_idx, col, nz)
MGrunnill commented 3 years ago

I have been running to exactly the same issue and PyMap's change has fixed it. Thank you PyMap. I would suggest that this solution is merged with the develop (default) branch.

PyMap commented 3 years ago

I've been doing further verifications to be sure that proposed changes in PR65 correctly solves pandas incompatibilities since >=0.15.0 versions.

While proportional updating process run, the _FrequencyAndConstraints class is instatiated by using frequency and contraints household tables. Here, the _drop_zeros function is applied to the frequencies dataframe and returns indexes for non zero values that are also used to filter the constraints Series (both shares the household idx).

To achieve this,_drop_zeros uses the nested function for_each_col whose main objective is to return each column of the frequencies dataframe filtered. Main reason of the crash here is that in recent pandas the apply method tries to store arrays of different lengths when for_each_col receives the frequencies table Series.

When for_each_col is applied, raw parameter could be set as True or False depending on the way that nonzero is applied to each Series. Given that this last object doesn't support anymore the previous operation it is necessary then to cast it as a numpy array (which can be achieved with raw=True , using to_numpy as suggested in official documentation or directly with .values method).

The expected result for the _drop_zeros function is a generator object containing tuples with:

(col_idx, --> this is the number of the column in grequencies table, which is also the name
 (col, --> array of non zero values,
  nz) --> row indexes of non zero values
)

With older pandas versions, the raw = True was able to return that expected structure as follows:

result = df.apply(for_each_col, axis=0, raw=True).items()
>>> result[21984]
(21984, --> col_idx 
 (array([1., 1., 1., 1., 1., 1.]), --> col
  array([   0,  271, 1181, 1601, 2002, 2531]) --> nz))

By applyling for_each_col with axis=0 orientation, will always return a (2, n) shape. First dimension corresponding to the row/col indexes in frequencies dataframe and n being an aleatory result that could have different lengths depending on nonzero filtering. Storing arrays of different shapes is, apparently, what is no longer supported in recent pandas.

So, instead of using the raw = True while df.apply(for_each_col) to directly receive a numpy array, the workaround I tried is to cast the pandas Series as array before applying nonzero and then filter the Series. As follows:

raw 'False' - using ".values" instead of to_numpy for Travis environment compatibility
-----------------------------------------------------------------------------------------------------------------
def for_each_test(col):
    nz = col.values.nonzero()[0]
    return col.iloc[nz], nz

result = df.apply(for_each_col, axis=0, raw=False).items()  
>>> result[21984]
(21984,      --> col_idx 
(hh_id       --> col
 0       1.0
 271     1.0
 1181    1.0
 1601    1.0
 2002    1.0
 2531    1.0
 Name: 21984, dtype: float64, 
 array([   0,  271, 1181, 1601, 2002, 2531]) --> nz 
   )
 )

This way, we avoid having the shape error I mentioned before and we assure to return the expected results.