performance of PPS in non-linear situation from blog

Ijustwantyouhappy commented 4 years ago

Uv8Mbd

FlorianWetschoreck commented 4 years ago

Thank you for sharing this analysis @Ijustwantyouhappy Can you maybe share the code for the analysis and translate the Chinese (?) characters? :)

Ijustwantyouhappy commented 4 years ago

from scipy.stats import pearsonr, spearmanr, kendalltau
import ppscore
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

def plot_data(xy, corrs, ax=None, xlim=None, ylim=None):
    """

    :param xy: np.array, shape: n * 2
    :param corrs: [(corr_name, corr_func), ...], corr_func(x, y) -> corr_xy
    :param ax:
    :return: 
    """
    if ax is None:
        _, ax = plt.subplots()
    if xlim is not None:
        ax.set_xlim(xlim)
    if ylim is not None:
        ax.set_ylim(ylim)
    ax.set_frame_on(False)
    ax.axes.get_xaxis().set_visible(False)
    ax.axes.get_yaxis().set_visible(False)
    # 
    ax.plot(*xy.T, ',')
    title = []
    for name, func in corrs:
        corr_xy = func(xy[:, 0], xy[:, 1])
        corr_yx = func(xy[:, 1], xy[:, 0]) 
        title.append(f"{name}: {np.round(corr_xy, 3)}, {np.round(corr_yx, 3)}")
    ax.set_title('\n'.join(title))

def pearson(x, y):
    return pearsonr(x, y)[0]

def spearman(x, y):
    return spearmanr(x, y)[0]

def kendall(x, y):
    return kendalltau(x, y)[0]

def pps(x, y):
    df = pd.DataFrame({"x": x, "y": y})
    return ppscore.score(df, 'x', 'y')['ppscore']

def rotation(xy, t):
    return np.dot(xy, [[np.cos(t), -np.sin(t)],
                       [np.sin(t), np.cos(t)]])

corr_config = [('pearson', pearson), ('spearman', spearman), ('kendall', kendall), ('pps', pps)]

# image1
corrs = [1.0, 0.8, 0.4, 0.0, -0.4, -0.8, -1.0]
n = 800
_, axes = plt.subplots(1, len(corrs), figsize=(2 * len(corrs), 2.5))
for i, corr in enumerate(corrs):
    cov = [[1, corr], [corr, 1]]  # 协方差矩阵
    xy = np.random.multivariate_normal([0, 0], cov, n)  # 多维正态分布
    plot_data(xy, corr_config, axes[i])
plt.tight_layout();

# image2 rotation normal
ts = np.array([0, 1/12, 1/6, 1/4, 1/3, 5/12, 1/2]) * np.pi
n = 1000
xy = np.random.multivariate_normal([0, 0], [[1, 1], [1, 1]], n)
_, axes = plt.subplots(1, len(ts), figsize=(2 * len(ts), 2.5))
for i, t in enumerate(ts):
    xy_rot = rotation(xy, t)
    plot_data(xy_rot, corr_config,
              axes[i], xlim=(-4, 4), ylim=(-4, 4))
plt.tight_layout();

# image3 non-linear situations
_, axes = plt.subplots(1, len(ts), figsize=(2 * len(ts), 2.5))
n = 1000

# fig1
x = np.random.uniform(-1, 1, n)
y = 4 * (x**2 - 0.5)**2 + np.random.uniform(-1, 1, n) / 3
plot_data(np.array([x, y]).T, corr_config, ax=axes[0])

# fig2
y = np.random.uniform(-1, 1, n)
xy = np.array([x, y]).T
xy = rotation(xy, -np.pi / 8)
plot_data(xy, corr_config, ax=axes[1])

# fig3
xy = rotation(xy, -np.pi / 8)
plot_data(xy, corr_config, ax=axes[2])

# fig4
y = 2 * x**2 + np.random.uniform(-1, 1, n)
plot_data(np.array([x, y]).T, corr_config, ax=axes[3])

# fig5
y = (x**2 + np.random.uniform(0, 0.5, n)) * \
        np.array([-1, 1])[np.random.randint(0, 2, size=n)]
plot_data(np.array([x, y]).T, corr_config, ax=axes[4])

# fig6
y = np.cos(x * np.pi) + np.random.uniform(0, 1/8, n)
x = np.sin(x * np.pi) + np.random.uniform(0, 1/8, n)
plot_data(np.array([x, y]).T, corr_config, ax=axes[5])

# fig7
xy1 = np.random.multivariate_normal([3, 3], [[1, 0], [0, 1]], int(n/4))
xy2 = np.random.multivariate_normal([-3, 3], [[1, 0], [0, 1]], int(n/4))
xy3 = np.random.multivariate_normal([-3, -3], [[1, 0], [0, 1]], int(n/4))
xy4 = np.random.multivariate_normal([3, -3], [[1, 0], [0, 1]], int(n/4))
xy = np.concatenate((xy1, xy2, xy3, xy4), axis=0)
plot_data(xy, corr_config, ax=axes[6])

plt.tight_layout();

FlorianWetschoreck commented 4 years ago

Great, thank you. :) Can you please also translate the Chinese characters that you added between some of the graphs in the picture? I am interested what they say

It is also fine if you just copy the characters here so that I can run them through Google translate but it is a little unconvenient to extract them from the picture.

Ijustwantyouhappy commented 4 years ago

Uh... that's just my feelings about these metrics. I'm used to writing some notes after trying out a brand-new tool.

In my opinion, PPS is beyond doubt a creative and informative metric, but if there are heavy noises, or the relationship from x to y is one-to-many potentially, PPS will perform poorly, even worse than correlations, and the blog didn't seem to mention this.

FlorianWetschoreck commented 4 years ago

Sure, thank you for sharing your honest thoughts. To which scenario (in the graph or outside the graph) do you refer where PPS is performing worse than correlations?

Ijustwantyouhappy commented 4 years ago

corrs = [1.0, 0.8, 0.4, 0.0, -0.4, -0.8, -1.0]
n = 800
_, axes = plt.subplots(1, len(corrs), figsize=(2 * len(corrs), 2.5))
for i, corr in enumerate(corrs):
    cov = [[1, corr], [corr, 1]]  # 协方差矩阵
    xy = np.random.multivariate_normal([0, 0], cov, n)  # 多维正态分布
    plot_data(xy, corr_config, axes[i])
plt.tight_layout();

1596089945(1)

Still in fake data, pps and correlations both perform good in perfect linear relations (fig1 and fig7), but when noises become heavier, for single feature x, the range of y becomes wider, correlations still perform good while pps almost reduces to 0 (fig2 and fig6), and even by eyes we can detect this strong linear relationship. Honestly, this situation is fairly common in practical problems. So in my last comment, I think we can't just ignore correlations and visualizations of data.

I do like this concept of pps, actually, it provides a novel perspective in EDA. So I try to use it in my recent works for feature selection. It's a online sales forecast problem for a global cosmetic company giant. We transform this time series forecast probelm to... Uh in short we extract about more than one hundred features and establish tree-based ensemble models, but unfortunately we only have a little historical data less than 2 years. Uh I seems to be too wordy... Because of the privacy policy, I can't show specific graphs or scores here.

The design of pps only considers one single feature's predictive power of target value, so many feature's pps will get exactly 0, but they can reach high feature importance when used together with others, for example features like Sex used in regression problems.

FlorianWetschoreck commented 4 years ago

Thank you for sharing this comparison and the code.

And I agree that the PPS is worse in those linear cases. We have also thought about maybe reporting the max of the PPS and the correlation in order to not lose the insights from correlation but I think we need some more testing to see if this makes sense. Maybe it also makes sense to merge the PPS with MIC or another score. Also, currently, the PPS has some problems when there are numeric outliers because they distort the total sum of errors. We hope to find a workaround there, too.

And of course I agree that PPS alone is not sufficient for feature selection because we also need to assess the feature importance when using multiple variables in the model.

8080labs / ppscore

performance of PPS in non-linear situation from blog #28