ThomasBury / arfs

All Relevant Feature Selection
MIT License
116 stars 12 forks source link

Issue with Custom Callable Implementation in CollinearityThreshold Class #35

Closed Pacman1984 closed 11 months ago

Pacman1984 commented 11 months ago

Title: Custom callable/function for CollinearityThreshold Class (nom_nom_assoc | num_num_assoc | nom_num_assoc)

Body:

Description of the Issue: I encountered an error while trying to implement a custom callable for the CollinearityThreshold class, specifically when integrating the Predictive Power Score (PPS). The Code describes the implementation as follows: "If callable, a function which receives two pd.Series (and optionally a weight array) and returns a single number."

Code Sample: I've implemented the PPS as follows:

def ppscore_arfs(x, y, **kwargs):
    """
    Calculate the Predictive Power Score (PPS) for series x with respect to series  y.

    Parameters:
        x (pandas.Series): A series representing a feature.
        y (pandas.Series): A series representing a feature.
        **kwargs: Additional keyword arguments for the ppscore function.

    Returns:
        float: A score representing the PPS between x and y.
    """
    import ppscore as pps

    # Merging x and y into a single DataFrame
    df = pd.concat([x, y], axis=1)

    # Calculating the PPS and extracting the score
    score = float(pps.score(df, df.columns[0], df.columns[1])['ppscore'])

    return score

I then applied this function in the CollinearityThreshold class as follows:

selector = CollinearityThreshold(
    method="association",
    nom_nom_assoc=ppscore_arfs,
    num_num_assoc=ppscore_arfs,
    nom_num_assoc=ppscore_arfs,
    threshold=0.85,
).fit(X, y)

Error Encountered: Upon executing the above, I received the following error message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[i:\Coding\00_Playground\arfs\test_arfs.ipynb](file:///I:/Coding/00_Playground/arfs/test_arfs.ipynb) Cell 20 line 7
      [1](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=0) selector = CollinearityThreshold(
      [2](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=1)     method="association",
      [3](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=2)     nom_nom_assoc=ppscore_arfs,
      [4](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=3)     num_num_assoc=ppscore_arfs,
      [5](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=4)     nom_num_assoc=ppscore_arfs,
      [6](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=5)     threshold=0.85,
----> [7](vscode-notebook-cell:/i%3A/Coding/00_Playground/arfs/test_arfs.ipynb#X26sZmlsZQ%3D%3D?line=6) ).fit(X, y)

File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\arfs\feature_selection\unsupervised.py:349](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:349), in CollinearityThreshold.fit(self, X, y, sample_weight)
    [346](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:346)     X = encoder.fit_transform(X)
    [347](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:347)     del encoder
--> [349](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:349) assoc_matrix = association_matrix(
    [350](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:350)     X=X,
    [351](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:351)     sample_weight=sample_weight,
    [352](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:352)     n_jobs=self.n_jobs,
    [353](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:353)     nom_nom_assoc=self.nom_nom_assoc,
    [354](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:354)     num_num_assoc=self.num_num_assoc,
    [355](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:355)     nom_num_assoc=self.nom_num_assoc,
    [356](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:356) )
    [357](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:357) self.assoc_matrix_ = xy_to_matrix(assoc_matrix)
    [359](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/feature_selection/unsupervised.py:359) to_drop = _recursive_collinear_elimination(self.assoc_matrix_, self.threshold)

File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\arfs\association.py:1227](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1227), in association_matrix(X, sample_weight, nom_nom_assoc, num_num_assoc, nom_num_assoc, n_jobs, handle_na, nom_nom_comb, num_num_comb, nom_num_comb)
   [1225](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1225) if n_num_cols >= 2:
   [1226](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1226)     if callable(num_num_assoc):
-> [1227](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1227)         w_num_num = _callable_association_matrix_fn(
   [1228](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1228)             assoc_fn=num_num_assoc,
   [1229](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1229)             cols_comb=num_num_comb,
   [1230](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1230)             kind="num-num",
   [1231](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1231)             X=X,
   [1232](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1232)             sample_weight=sample_weight,
   [1233](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1233)             n_jobs=n_jobs,
   [1234](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1234)         )
   [1235](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1235)     else:
   [1236](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1236)         w_num_num = wcorr_matrix(
   [1237](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1237)             X, sample_weight, n_jobs, handle_na=None, method=num_num_assoc
   [1238](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1238)         )

File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\arfs\association.py:1426](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1426), in _callable_association_matrix_fn(assoc_fn, X, sample_weight, n_jobs, kind, cols_comb)
   [1424](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1424)     cols_comb = [comb for comb in combinations(selected_cols, 2)]
   [1425](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1425)     _assoc_fn = partial(_compute_matrix_entries, func_xyw=assoc_fn)
-> [1426](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1426)     assoc = parallel_matrix_entries(
   [1427](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1427)         func=_assoc_fn,
   [1428](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1428)         df=X,
   [1429](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1429)         comb_list=cols_comb,
   [1430](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1430)         sample_weight=sample_weight,
   [1431](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1431)         n_jobs=n_jobs,
   [1432](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1432)     )
   [1434](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1434) else:
   [1435](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/association.py:1435)     assoc = None

File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\arfs\parallel.py:55](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:55), in parallel_matrix_entries(func, df, comb_list, sample_weight, n_jobs)
     [50](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:50) lst = Parallel(n_jobs=n_jobs)(
     [51](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:51)     delayed(func)(X=df, sample_weight=sample_weight, comb_list=comb_chunk)
     [52](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:52)     for comb_chunk in comb_chunks
     [53](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:53) )
     [54](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:54) # return flatten list of pandas DF
---> [55](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/arfs/parallel.py:55) return pd.concat(list(chain(*lst)), ignore_index=True)

File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\pandas\util\_decorators.py:331](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:331), in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    [325](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:325) if len(args) > num_allow_args:
    [326](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:326)     warnings.warn(
    [327](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:327)         msg.format(arguments=_format_argument_list(allow_args)),
    [328](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:328)         FutureWarning,
    [329](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:329)         stacklevel=find_stack_level(),
    [330](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:330)     )
--> [331](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/util/_decorators.py:331) return func(*args, **kwargs)

File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\pandas\core\reshape\concat.py:368](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:368), in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    [146](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:146) @deprecate_nonkeyword_arguments(version=None, allowed_args=["objs"])
    [147](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:147) def concat(
    [148](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:148)     objs: Iterable[NDFrame] | Mapping[HashableT, NDFrame],
   (...)
    [157](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:157)     copy: bool = True,
    [158](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:158) ) -> DataFrame | Series:
    [159](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:159)     """
    [160](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:160)     Concatenate pandas objects along a particular axis.
    [161](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:161) 
   (...)
    [366](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:366)     1   3   4
    [367](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:367)     """
--> [368](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:368)     op = _Concatenator(
    [369](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:369)         objs,
    [370](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:370)         axis=axis,
    [371](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:371)         ignore_index=ignore_index,
    [372](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:372)         join=join,
    [373](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:373)         keys=keys,
    [374](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:374)         levels=levels,
    [375](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:375)         names=names,
    [376](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:376)         verify_integrity=verify_integrity,
    [377](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:377)         copy=copy,
    [378](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:378)         sort=sort,
    [379](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:379)     )
    [381](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:381)     return op.get_result()

File [i:\Coding\00_Playground\arfs\.venv\lib\site-packages\pandas\core\reshape\concat.py:458](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:458), in _Concatenator.__init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    [453](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:453)     if not isinstance(obj, (ABCSeries, ABCDataFrame)):
    [454](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:454)         msg = (
    [455](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:455)             f"cannot concatenate object of type '{type(obj)}'; "
    [456](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:456)             "only Series and DataFrame objs are valid"
    [457](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:457)         )
--> [458](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:458)         raise TypeError(msg)
    [460](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:460)     ndims.add(obj.ndim)
    [462](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:462) # get the sample
    [463](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:463) # want the highest ndim that we have, and must be non-empty
    [464](file:///I:/Coding/00_Playground/arfs/.venv/lib/site-packages/pandas/core/reshape/concat.py:464) # unless all objs are empty

TypeError: cannot concatenate object of type '<class 'float'>'; only Series and DataFrame objs are valid

Request for Assistance: I am seeking guidance on how to resolve this error. It seems to be related to the way the ppscore_arfs function is implemented or how it's integrated with the CollinearityThreshold class. Any insights or suggestions on how to correctly implement this custom callable would be greatly appreciated.

Thank you in advance for your assistance!

ThomasBury commented 11 months ago

Hi @Pacman1984, I've made some updates to the association module. The CollinearityThreshold selector utilizes the association matrix to efficiently eliminate redundant features and those association functions are now more efficient and easier to use. These functions have been enhanced to accept callable inputs directly, use vectorized operations.

Please note some important aspects:

You can follow the steps illustrated in the new tuto, summarized below.

Last, those changes are implemented in the 2.2.0 version, soon to be released on pypi

import ppscore as pps

@asymmetric_function
def ppscore_arfs(x, y, sample_weight=None, as_frame=True):
    """
    Calculate the Predictive Power Score (PPS) for series x with respect to series y.

    The PPS is a score that shows the predictive relationship between two variables. 
    This function calculates the PPS of x predicting y. If the series have the same name, 
    the function assumes they are identical and returns a score of 1. 

    Parameters
    ----------
    x : pandas.Series
        A pandas Series representing a feature.
    y : pandas.Series
        Another pandas Series representing a feature.
    as_frame : bool, optional
        If True, the function returns the result as a pandas DataFrame;
        otherwise, it returns a float value. The default is False.

    Returns
    -------
    Union[float, pandas.DataFrame]
        A score representing the PPS between x and y. 
        If `as_frame` is True, returns a DataFrame with the columns "row", "col", and "val",
        where "row" and "col" represent the names of the series x and y, respectively, 
        and "val" is the PPS score. If `as_frame` is False, returns the PPS score as a float.
    """

    # Merging x and y into a single DataFrame

    # Ensure x and y are DataFrames with only one column
    if (isinstance(x, pd.DataFrame) and isinstance(y, pd.DataFrame) and x.shape[1] == 1 and y.shape[1] == 1):
        # Extracting the series from the DataFrames
        x = x.iloc[:, 0]
        y = y.iloc[:, 0]

    if x.name == y.name:
        score = 1
    else: 
        df = pd.DataFrame({"x": x.values, "y": y.values})
        # Calculating the PPS and extracting the score
        score = pps.score(df, df.columns[0], df.columns[1])['ppscore']

    if as_frame:
        return pd.DataFrame({"row": x.name, "col": y.name, "val":score}, index=[0])
    else:
        return score

You can use it as:

selector = CollinearityThreshold(
    method="association",
    nom_nom_assoc=ppscore_arfs,
    num_num_assoc=ppscore_arfs,
    nom_num_assoc=ppscore_arfs,
    threshold=0.5,
).fit(X)

print(f"The features going in the selector are : {selector.feature_names_in_}")
print(f"The support is : {selector.support_}")
print(f"The selected features are : {selector.get_feature_names_out()}")
Pacman1984 commented 11 months ago

Thanks @ThomasBury for the Update. Great to see this