ageron / handson-ml2

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.
Apache License 2.0
27.31k stars 12.61k forks source link

On the newly added confidence interval section in chapter 2 #9

Closed nicholashub closed 4 years ago

nicholashub commented 5 years ago

In chapter 2, the confidence interval of generalization error (RMSE) is calculated by taking the square root of a t interval. The code is as follows,

from scipy import stats
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2 #y_test is real values, final_predictions is predicted values of y
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1, loc=squared_errors.mean(), scale=stats.sem(squared_errors)))

It's unintuitive to me, though (isn't the sum of squared errors follow something like a ${\chi}^2$ distribution?). I wonder under what assumptions (such as what distribution of the variables) this result is deduced, thanks!

ageron commented 5 years ago

Hi @nicholashub , Great question!

A confidence interval is computed for the mean of a sample from a random distribution. We don't need to know what this actual random distribution is, thanks to the central limit theorem which states that the sampling distribution of the sample mean of any random distribution (provided its mean is finite and its standard deviation is not zero), will tend towards a Gaussian distribution as the sample size increases. So if we have enough instances in the test set (which is basically always the case, since even just 100 instances is sufficient), then the confidence interval should be fairly meaningful (regardless of the actual error distribution).

In the code I used in the book, the random distribution I'm considering is the distribution of the squared errors. I'm not making any assumption about it (I don't need to, as explained above). The stats.t.interval() code will give a confidence interval for the mean of the the squared errors, i.e., the MSE. As you probably know given your question (but I'm writing this for other people who will read this thread), the interpretation of this interval is not trivial: it does not mean that there is 95% chance that the true MSE (i.e., the generalization error) is within this interval. It means that if we were to sample many test sets and compute the confidence interval using this same procedure, then 95% of the time the true MSE would lie within the computed interval. It's a subtle but important difference (to learn more about this, research Frequentistism vs Bayesianism).

Now, if the true MSE lies within an interval [A, B], then the RMSE must lie within the interval [sqrt(A), sqrt(B)], since the RMSE is just the square root of the MSE. However, although the confidence interval [A, B] is centered on the true MSE, on average (i.e., if we were to sample more and more test sets, and compute the mean of the confidence interval centers, it would approach the true MSE), the interval [sqrt(A), sqrt(B)] is not centered on the true RMSE, on average.

Below is a script that creates a population of 1 million values from an arbitrary distribution, as well as some predictions, also from an arbitrary (but different) distribution. Then it samples a test set of 100 instances from that population and computes the confidence interval using the code from the book and checks whether the true RMSE is inside this interval. Then it samples a new test set and does the same thing. And it keeps doing this 10,000 times, and counts the number of times that the confidence interval contained the true RMSE. It ends up very close to 95%, as expected. I also tried the same thing with the MSE, the MAE and just the mean error, and I always get ~95%.

import numpy as np
from scipy import stats

N = 10**6
y_true_pop = 5 * np.random.randn(N) ** 2 + 3 * np.log(100 * np.random.rand(N))
y_pred_pop = y_true_pop + 5 * np.cos(10 * np.random.randn(N)) + 5 * np.random.randn(N) - 0.5
true_mean_error = np.mean(y_pred_pop - y_true_pop)
true_mae = np.mean(np.abs(y_pred_pop - y_true_pop))
true_mse = np.mean((y_pred_pop - y_true_pop) ** 2)
true_rmse = np.sqrt(true_mse)
true_mean_error = np.mean(y_pred_pop - y_true_pop)

def sample(n_samples):
  sample = np.random.randint(N, size=n_samples)
  y_true = y_true_pop[sample]
  y_pred = y_pred_pop[sample]
  return y_true, y_pred

def compute_interval(samples, confidence=0.95):
  return stats.t.interval(confidence,
                          len(samples) - 1,
                          loc=samples.mean(),
                          scale=stats.sem(samples))

def true_mse_in_interval(confidence=0.95, n_samples=100):
    y_true, y_pred = sample(n_samples)
    squared_errors = (y_pred - y_true) ** 2
    interval = compute_interval(squared_errors, confidence)
    return interval[0] < true_mse < interval[1]

def true_rmse_in_interval(confidence=0.95, n_samples=100):
    y_true, y_pred = sample(n_samples)
    squared_errors = (y_pred - y_true) ** 2
    interval = np.sqrt(compute_interval(squared_errors, confidence))
    return interval[0] < true_rmse < interval[1]

def true_mae_in_interval(confidence=0.95, n_samples=100):
    y_true, y_pred = sample(n_samples)
    absolute_errors = np.abs(y_pred - y_true)
    interval = compute_interval(absolute_errors, confidence)
    return interval[0] < true_mae < interval[1]

def true_mean_error_in_interval(confidence=0.95, n_samples=100):
    y_true, y_pred = sample(n_samples)
    errors = y_pred - y_true
    interval = compute_interval(errors, confidence)
    return interval[0] < true_mean_error < interval[1]

def estimate_proba_in_interval(func, confidence=0.95, n_samples=100,
                               n_experiments=10000):
    n_in_interval = 0
    for _ in range(n_experiments):
        if func(confidence=confidence, n_samples=n_samples):
            n_in_interval += 1
    return n_in_interval / n_experiments

print("Estimated proba that the MSE is in interval:",
    estimate_proba_in_interval(true_mse_in_interval))
print("Estimated proba that the RMSE is in interval:",
    estimate_proba_in_interval(true_rmse_in_interval))
print("Estimated proba that the MAE is in interval:",
    estimate_proba_in_interval(true_mae_in_interval))
print("Estimated proba that the Mean Error is in interval:",
    estimate_proba_in_interval(true_mean_error_in_interval))

The output is:

Estimated proba that the MSE is in interval: 0.9412
Estimated proba that the RMSE is in interval: 0.9369
Estimated proba that the MAE is in interval: 0.9506
Estimated proba that the Mean Error is in interval: 0.9435

This is the ratio of test sets where the procedure produced an interval that contained the true metric (over the whole population).

I hope this is all clear?

nicholashub commented 5 years ago

Thanks! This usage of CLT is really a smart workaround to avoid finding conventional confidence interval for RMSE! Also, appreciate the code block for the empirical check!

Lawrenauh commented 5 years ago

Though majored in statistics, I have also been being confused here for a long time, and your explanation is really awesome! Thanks for your explicit explanation and great book~

Lawrenauh commented 5 years ago

hi, @ageron , I think there still might be some problem with the MSE's CI computed by using scipy.stats.t.interval() directly, or need to be explained further.

As you have explained above clearly, the Mean of MSE will tend towards a Gaussian distribution according to the CLT. That's exactly right and really brilliant. But the next doing CI step is doubtful according to the following fact:

when r.v. X (not the mean of X) follows a normal distribution with an unknown population mean(marked as u) and deviation(as std), then we can use scipy.stats.t.interval() to compute confidence interval of u, as sqrt(n)*(mean(X)-u)/S ~ t distribution with df=n-1(here S is the sample variance), and even the IC of std.

But here, the Mean of X follows a normal distribution, not X, we can not use the scipy.stats.t.interval() directly, right?

Maybe, we can split the instances set into subsets (as long as the instances are sufficient), like n = m*k, k instances each subset and totally m subsets. And if n is big enough, we can assume m and k big enough. But eventually, the degree of freedom will be changed to m-1, and other parameters in scipy.stats.t.interval(), especially the scale will vary significantly. So I can not be very confirmed that using scipy.stats.t.interval() is right, or need to be given a more rigorous proof.

Lawrenauh commented 5 years ago

@nicholashub It's true that CLT doesn't require individual X under some conditions. But it's nothing to do with my question here. What I mean is that only the MSE follows normality is insufficient to use scipy.stats.t.interval() to compute CI of MSE while @ageron use SE(not MSE) as sample instances to compute loc and scale parameters in scipy.stats.t.interval(confidence, df, loc, scale), as the theory fact I have stated above: sqrt(n)*(mean(X)-u)/S ~ t-distribution(with df=n-1) come to establish when X follows normality (not just Mean of X). Maybe it's ok to do that approximately, but need to be given somehow a more rigorous proof.

Sorry for my poor English expression, it's a little round.

nicholashub commented 5 years ago

@Lawrenauh Take a look at this. My understanding is as long as you can show sample mean is normal and sample variance is Chi-squared and they are independent, then the assumption of a normal population can be relaxed.

Lawrenauh commented 5 years ago

@Lawrenauh Take a look at this. My understanding is as long as you can show sample mean is normal and sample variance is Chi-squared and they are independent, then the assumption of a normal population can be relaxed.

You're right! But in this situation, how can you infer the sample variance of X is Chi-squared, only under the condition that the Mean of X is normal?

nicholashub commented 5 years ago

@Lawrenauh Slutsky's theorem.

Lawrenauh commented 5 years ago

@Lawrenauh Slutsky's theorem.

Whatever X's distribution...maybe a rigorous proof? Thanks~

Proof: the sample variance of X is Chi-squared, only under the condition that the Mean of X is normal.

@nicholashub It seems like you need to prove X(i)-Mean(X) converges in distribution to a normal, whatever X's distribution. It's false intuitively.

nicholashub commented 5 years ago

@Lawrenauh

how can you infer the sample variance of X is Chi-squared

Precisely, you don't. To quote from Wikipedia,

Slutsky's theorem implies that the distribution of the sample variance has little effect on the distribution of the test statistic

My understanding is for the t-statistic, you can divide by sigma/sqrt(n) in both numerator and denominator. Now the numerator converges in distribution to N(0, 1) by CLT. For the denominator, it's now sqrt(S^2/sigma^2). And we know S^2 converges to sigma^2 in probability by WLLN. Now, based on continuous mapping theorem, the whole denominator converges to constant 1 in probability. Finally, use the 3rd property in Slutsky's theorem to conclude your t-statistic converges in distribution to N(0, 1). That is, even the population is not normally distributed, the t-statistic can be used as if it's normal. Then why we don't use Z instead? You can, but I feel like it's because t-statistic and normal variable are so close when df is large and also maybe for some historical reason, people are comfortable with interchanging t-statistic with normal statistic, given df is very large?

But of course, if your data is not large enough, none of the Slutsky's theorem and CLT would help. Again, this is mentioned by Wikipedia,

If the data are substantially non-normal and the sample size is small, the t-test can give misleading results

Now, I think the real issue your question brought here is actually the data size, @ageron said for the test set

even just 100 instances is sufficient

But maybe more study (both empirical and theoretical) should focus on this due to the many approximation and convergence stuff we use.

Lawrenauh commented 5 years ago

My understanding is for the t-statistic, you can divide by sigma/sqrt(n) in both numerator and denominator. Now the numerator converges in distribution to N(0, 1) by CLT. For the denominator, it's now sqrt(S^2/sigma^2). And we know S^2 converges to sigma^2 in probability by WLLN. Now, based on continuous mapping theorem, the whole denominator converges to constant 1 in probability. Finally, use the 3rd property in Slutsky's theorem to conclude your t-statistic converges in distribution to N(0, 1). That is, even the population is not normally distributed, the t-statistic can be used as if it's normal.

Your inference is convincing! A rigorous one, haha~~ The final conclusion is the t-statistic converges in distribution to N(0,1) no matter what the population distribution is. And it's really useful~That's all. There is no need to make any explanations for people interchanging t-statistic and normal statistic or whatever. People intuitively think t-statistic should follow a t-distribution, so they use it for computing CI without further thinking. But here a normal should be more convincing @ageron ~ Thanks for your work @nicholashub ~

ageron commented 5 years ago

Thanks to both of you, @nicholashub and @Lawrenauh, this is a very interesting thread! 👍 In conclusion, I think I'll add a comment in the book to say that the method to compute the confidence interval requires the test set to be large enough. In practice, I've found that ~100 instances was sufficient, but this probably depends on the task. The farther the distribution is to a Normal distribution, the larger the test set should be. Sounds good?

Lawrenauh commented 5 years ago

When the instances are sufficient, in practical applications using t or normal distribution to estimate CI of MSE is of little difference. But I'm very curious about the relationship between t-statistic and t-distribution in statistic whatever population's distribution. As @nicholashub has proved that t-statistic converges in distribution to N(0,1), but little theoretical result has further revealed the relationship between t-statistic and t-distribution. There should be more, and will be very meaningful.

Lawrenauh commented 5 years ago

More details about this topic: Confidence Intervals for the Mean of Non-normal Data

In this article, a normality was preferred and showed how large the samples would be appropriate by simulations in the ending.

vasili111 commented 4 years ago

@ageron

I think it will be good to add it to notebook:

I think I'll add a comment in the book to say that the method to compute the confidence interval requires the test set to be large enough. In practice, I've found that ~100 instances was sufficient, but this probably depends on the task. The farther the distribution is to a Normal distribution, the larger the test set should be.

I think it is important to know that. In project that I am doing in parrallel with book I have low sample size and CI is wide. Now I know the reason.

vasili111 commented 3 years ago

@ageron

In conclusion, I think I'll add a comment in the book to say that the method to compute the confidence interval requires the test set to be large enough. In practice, I've found that ~100 instances was sufficient, but this probably depends on the task. The farther the distribution is to a Normal distribution, the larger the test set should be.

How well bootstraping data works here? For example, I have 30 instances in my test set. If I bootstrap that data and get more than 100 instances will the CI calculated from that data accurate? If yes, should I report both the estimate and CI from bootstraped data or estimate should be estimated on actual data and CI on bootstraped one?