fairnessforensics / wiggum

simpson's paradox inspired fairness forensics
https://fairnessforensics.github.io/wiggum/
MIT License
5 stars 3 forks source link

modin issue #202

Open Shine226 opened 3 years ago

Shine226 commented 3 years ago

1) Run Wiggum in a regular terminal rather than vs studio since error occurs in vs studio. 2) Still need to import pandas for pandas.core 3) New error

  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum_app/controller.py", line 273, in main
    labeled_df_setup.get_subgroup_trends_1lev(trend_list)
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/detectors.py", line 239, in get_subgroup_trends_1lev
    groupby_vars = self.get_vars_per_role('splitby')
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/labeled_dataframe.py", line 490, in get_vars_per_role
    return list(all_vars[is_target_role & drop_ignore])
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4307, in __getitem__
    result = getitem(key)
IndexError: too many indices for array
cegme commented 3 years ago

@Shine226 Can you try and repoduce the error without using the all_vars index? Just use all_vars.

Shine226 commented 3 years ago
First error:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 2463, in __call__
    return self.wsgi_app(environ, start_response)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 2449, in wsgi_app
    response = self.handle_exception(e)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1866, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1951, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1820, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum_app/controller.py", line 273, in main
    labeled_df_setup.get_subgroup_trends_1lev(trend_list)
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/detectors.py", line 259, in get_subgroup_trends_1lev
    cur_trend.get_trend_vars(self)
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/trend_components/base_getvars.py", line 181, in get_trend_vars
    ['ordinal','continuous'],['ordinal','continuous'])
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/trend_components/base_getvars.py", line 102, in set_weights_regression
    indep_vars = labeled_df.get_vars_per_roletype('independent', i_type)
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/labeled_dataframe.py", line 528, in get_vars_per_roletype
    drop_ignore)]
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/labeled_dataframe.py", line 527, in <listcomp>
    target_rows = [r & t & d for r,t,d in zip(is_target_role,is_target_type,
TypeError: unsupported operand type(s) for &: 'str' and 'str'
brownsarahm commented 3 years ago

that error says that the wrong type got passed into the variables in that zip. those should all be boolean so that the & works.

Shine226 commented 3 years ago

Before changing to modin:

variable
sepal length     True
sepal width      True
petal length    False
petal width     False
class           False
dtype: bool
variable
sepal length     True
sepal width      True
petal length     True
petal width      True
class           False
dtype: bool
variable
sepal length    True
sepal width     True
petal length    True
petal width     True
class           True
dtype: bool
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

After applying Modin:

              __reduced__
variable                 
sepal length         True
sepal width         False
petal length         True
petal width          True
class               False
              __reduced__
variable                 
sepal length         True
sepal width          True
petal length         True
petal width          True
class               False
              __reduced__
variable                 
sepal length         True
sepal width          True
petal length         True
petal width          True
class                True
<class 'modin.pandas.dataframe.DataFrame'>
<class 'modin.pandas.dataframe.DataFrame'>
<class 'modin.pandas.dataframe.DataFrame'>
Shine226 commented 3 years ago

We can use .squeeze() to convert back to series.

Shine226 commented 3 years ago

New error:

UserWarning: `Series.align` defaulting to pandas implementation.
UserWarning: Distributing <class 'pandas.core.series.Series'> object. This may take some time.
UserWarning: Distributing <class 'list'> object. This may take some time.
127.0.0.1 - - [03/Jun/2021 15:47:44] "POST / HTTP/1.1" 500 -
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 2463, in __call__
    return self.wsgi_app(environ, start_response)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 2449, in wsgi_app
    response = self.handle_exception(e)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1866, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1951, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1820, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum_app/controller.py", line 274, in main
    labeled_df_setup.get_subgroup_trends_1lev(trend_list)
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/detectors.py", line 274, in get_subgroup_trends_1lev
    agg_trends = cur_trend.get_trends(self.df,'agg_trend')
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/trend_components/statistical.py", line 171, in get_trends
    groupby_name = groupby_name_by_type[type(data_df)](data_df)
KeyError: <class 'modin.pandas.dataframe.DataFrame'>
Shine226 commented 3 years ago

The groupby_name_by_type is using pandas.core.frame.Dataframe:

groupby_name_by_type = {pandas.core.groupby.DataFrameGroupBy:lambda df: df.keys,
                                pandas.core.frame.DataFrame:lambda df: None}
Shine226 commented 3 years ago

Change code to:

import modin.pandas as pd
groupby_name_by_type = {pd.groupby.DataFrameGroupBy:lambda df: df.keys,
                                pd.dataframe.DataFrame:lambda df: None}

new error:

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 2463, in __call__
    return self.wsgi_app(environ, start_response)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 2449, in wsgi_app
    response = self.handle_exception(e)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1866, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1951, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1820, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum_app/controller.py", line 274, in main
    labeled_df_setup.get_subgroup_trends_1lev(trend_list)
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/detectors.py", line 284, in get_subgroup_trends_1lev
    curgroup_trend_df = cur_trend.get_trends(cur_grouping,'subgroup_trend')
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/trend_components/statistical.py", line 173, in get_trends
    groupby_name = groupby_name_by_type[type(data_df)](data_df)
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/trend_components/statistical.py", line 6, in <lambda>
    groupby_name_by_type = {pd.groupby.DataFrameGroupBy:lambda df: df.keys,
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/pandas/groupby.py", line 125, in __getattr__
    raise e
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/pandas/groupby.py", line 121, in __getattr__
    return object.__getattribute__(self, key)
AttributeError: 'DataFrameGroupBy' object has no attribute 'keys'
Shine226 commented 3 years ago

Changing keys to _idx_name will solve no attribute 'keys', but not good for extracting a compound key such as ['sex', 'race']:

groupby_name_by_type = {pd.groupby.DataFrameGroupBy:lambda df: df._idx_name,
                                pd.dataframe.DataFrame:lambda df: None}
Shine226 commented 3 years ago

New error for corr function:

127.0.0.1 - - [09/Jun/2021 11:29:20] "POST / HTTP/1.1" 500 -
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 2463, in __call__
    return self.wsgi_app(environ, start_response)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 2449, in wsgi_app
    response = self.handle_exception(e)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1866, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1951, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1820, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum_app/controller.py", line 273, in main
    labeled_df_setup.get_subgroup_trends_1lev(trend_list)
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/detectors.py", line 284, in get_subgroup_trends_1lev
    curgroup_trend_df = cur_trend.get_trends(cur_grouping,'subgroup_trend')
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/trend_components/statistical.py", line 179, in get_trends
    corr_data = self.compute_correlation_table(data_df,trend_col_name)
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/trend_components/statistical.py", line 88, in compute_correlation_table
    corr_mat = data_df[corr_var_list].corr(method=self.corrtype)
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/pandas/groupby.py", line 622, in corr
    return self._default_to_pandas(lambda df: df.corr(**kwargs))
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/pandas/groupby.py", line 980, in _default_to_pandas
    return self._df._default_to_pandas(groupby_on_multiple_columns, *args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/pandas/base.py", line 400, in _default_to_pandas
    result = op(pandas_obj, *args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/pandas/groupby.py", line 974, in groupby_on_multiple_columns
    by=by, axis=self._axis, squeeze=self._squeeze, **self._kwargs
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 6727, in groupby
    dropna=dropna,
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 568, in __init__
    dropna=self.dropna,
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/grouper.py", line 811, in get_grouper
    raise KeyError(gpr)
KeyError: 'class'
Shine226 commented 3 years ago

Set index for DataFrame before groupby:

  # Modin issue: set index before groupby for corr() in get_trends
  self.df.index = self.df[groupbyAttr]

  #condition the data
  cur_grouping = self.df.groupby(groupbyAttr)
Shine226 commented 3 years ago

Invalid index error:

  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum_app/controller.py", line 273, in main
    labeled_df_setup.get_subgroup_trends_1lev(trend_list)
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/detectors.py", line 286, in get_subgroup_trends_1lev
    curgroup_trend_df = cur_trend.get_trends(cur_grouping,'subgroup_trend')
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/trend_components/statistical.py", line 179, in get_trends
    corr_data = self.compute_correlation_table(data_df,trend_col_name)
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/trend_components/statistical.py", line 98, in compute_correlation_table
    itertools.product(self.regression_vars,groupby_vars)]
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/trend_components/statistical.py", line 97, in <listcomp>
    corr_data = [(i,d, corr_mat[i][g][d],g) for (i,d),g in
IndexError: invalid index to scalar variable.
Shine226 commented 3 years ago

Change

corr_data = [(i,d, corr_mat[i][g][d],g) for (i,d),g in
                     itertools.product(self.regression_vars,groupby_vars)]

to

corr_data = [(i,d, corr_mat.loc[g,i][d],g) for (i,d),g in
                    itertools.product(self.regression_vars,groupby_vars)]
Shine226 commented 3 years ago

In Wiggum app, after loading data folder, when clicking the 'visualize' button, new error in add_distance():

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/worker.py", line 2403, in _maybe_deserialize_task
    function, args, kwargs = _deserialize(*self.tasks[key])
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/worker.py", line 3238, in _deserialize
    args = pickle.loads(args)
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 59, in loads
    return pickle.loads(x)
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/pandas/dataframe.py", line 2424, in _inflate_light
    return cls(query_compiler=query_compiler)
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/pandas/dataframe.py", line 90, in __init__
    Engine.subscribe(_update_engine)
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/config/pubsub.py", line 107, in subscribe
    callback(cls)
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/pandas/__init__.py", line 122, in _update_engine
    initialize_dask()
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/engines/dask/utils.py", line 37, in initialize_dask
    num_cpus = len(client.ncores())
TypeError: object of type 'coroutine' has no len()
127.0.0.1 - - [09/Jun/2021 15:38:25] "POST / HTTP/1.1" 500 -
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 2463, in __call__
    return self.wsgi_app(environ, start_response)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 2449, in wsgi_app
    response = self.handle_exception(e)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1866, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1951, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1820, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/anaconda3/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum_app/controller.py", line 279, in main
    labeled_df_setup.add_distance()
  File "/Users/chenguangxu/Documents/GitHub/detect_simpsons_paradox_dev/wiggum/ranking_processing.py", line 496, in add_distance
    self.result_df['distance'] = self.result_df.apply(dist_helper,axis=1)
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/pandas/dataframe.py", line 292, in apply
    func, axis=axis, raw=raw, result_type=result_type, args=args, **kwds
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/pandas/base.py", line 751, in apply
    **kwds,
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/backends/pandas/query_compiler.py", line 2348, in apply
    return self._callable_func(func, axis, *args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/backends/pandas/query_compiler.py", line 2446, in _callable_func
    axis, lambda df: df.apply(func, axis=axis, *args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/engines/base/frame/data.py", line 1308, in _apply_full_axis
    other=None,
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/engines/base/frame/data.py", line 1698, in broadcast_apply_full_axis
    for i, new_axis in enumerate([new_index, new_columns])
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/engines/base/frame/data.py", line 1698, in <listcomp>
    for i, new_axis in enumerate([new_index, new_columns])
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/engines/base/frame/data.py", line 260, in _compute_axis_labels
    axis, partitions, lambda df: df.axes[axis]
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/engines/dask/pandas_on_dask/frame/partition_manager.py", line 102, in get_indices
    new_idx = client.gather(new_idx)
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/client.py", line 1893, in gather
    asynchronous=asynchronous,
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/client.py", line 780, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/utils.py", line 348, in sync
    raise exc.with_traceback(tb)
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/utils.py", line 332, in f
    result[0] = yield future
  File "/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/client.py", line 1752, in _gather
    raise exception.with_traceback(traceback)
  File "/opt/anaconda3/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 59, in loads
    return pickle.loads(x)
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/pandas/dataframe.py", line 2424, in _inflate_light
    return cls(query_compiler=query_compiler)
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/pandas/dataframe.py", line 90, in __init__
    Engine.subscribe(_update_engine)
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/config/pubsub.py", line 107, in subscribe
    callback(cls)
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/pandas/__init__.py", line 122, in _update_engine
    initialize_dask()
  File "/opt/anaconda3/lib/python3.7/site-packages/modin/engines/dask/utils.py", line 37, in initialize_dask
    num_cpus = len(client.ncores())
TypeError: object of type 'coroutine' has no len()
Shine226 commented 3 years ago

after rerun 'pip install modin[all]', Error is gone.