Closed Pacman1984 closed 11 months ago
Hi @Pacman1984, I've made some updates to the association module. The CollinearityThreshold
selector utilizes the association matrix to efficiently eliminate redundant features and those association functions are now more efficient and easier to use. These functions have been enhanced to accept callable inputs directly, use vectorized operations.
Please note some important aspects:
CollinearityThreshold
is unsupervised and only requires the predictor matrix X
You can follow the steps illustrated in the new tuto, summarized below.
Last, those changes are implemented in the 2.2.0
version, soon to be released on pypi
import ppscore as pps
@asymmetric_function
def ppscore_arfs(x, y, sample_weight=None, as_frame=True):
"""
Calculate the Predictive Power Score (PPS) for series x with respect to series y.
The PPS is a score that shows the predictive relationship between two variables.
This function calculates the PPS of x predicting y. If the series have the same name,
the function assumes they are identical and returns a score of 1.
Parameters
----------
x : pandas.Series
A pandas Series representing a feature.
y : pandas.Series
Another pandas Series representing a feature.
as_frame : bool, optional
If True, the function returns the result as a pandas DataFrame;
otherwise, it returns a float value. The default is False.
Returns
-------
Union[float, pandas.DataFrame]
A score representing the PPS between x and y.
If `as_frame` is True, returns a DataFrame with the columns "row", "col", and "val",
where "row" and "col" represent the names of the series x and y, respectively,
and "val" is the PPS score. If `as_frame` is False, returns the PPS score as a float.
"""
# Merging x and y into a single DataFrame
# Ensure x and y are DataFrames with only one column
if (isinstance(x, pd.DataFrame) and isinstance(y, pd.DataFrame) and x.shape[1] == 1 and y.shape[1] == 1):
# Extracting the series from the DataFrames
x = x.iloc[:, 0]
y = y.iloc[:, 0]
if x.name == y.name:
score = 1
else:
df = pd.DataFrame({"x": x.values, "y": y.values})
# Calculating the PPS and extracting the score
score = pps.score(df, df.columns[0], df.columns[1])['ppscore']
if as_frame:
return pd.DataFrame({"row": x.name, "col": y.name, "val":score}, index=[0])
else:
return score
You can use it as:
selector = CollinearityThreshold(
method="association",
nom_nom_assoc=ppscore_arfs,
num_num_assoc=ppscore_arfs,
nom_num_assoc=ppscore_arfs,
threshold=0.5,
).fit(X)
print(f"The features going in the selector are : {selector.feature_names_in_}")
print(f"The support is : {selector.support_}")
print(f"The selected features are : {selector.get_feature_names_out()}")
Thanks @ThomasBury for the Update. Great to see this
Title: Custom callable/function for
CollinearityThreshold
Class (nom_nom_assoc | num_num_assoc | nom_num_assoc)Body:
Description of the Issue: I encountered an error while trying to implement a custom callable for the
CollinearityThreshold
class, specifically when integrating the Predictive Power Score (PPS). The Code describes the implementation as follows: "If callable, a function which receives twopd.Series
(and optionally a weight array) and returns a single number."Code Sample: I've implemented the PPS as follows:
I then applied this function in the
CollinearityThreshold
class as follows:Error Encountered: Upon executing the above, I received the following error message:
Request for Assistance: I am seeking guidance on how to resolve this error. It seems to be related to the way the
ppscore_arfs
function is implemented or how it's integrated with theCollinearityThreshold
class. Any insights or suggestions on how to correctly implement this custom callable would be greatly appreciated.Thank you in advance for your assistance!