Closed bfolie closed 2 years ago
A few thoughts/questions:
1) Since the methods (3) and (4) are not asymptotically correct in the limit of large B, I don't think they should be considered for the rectification strategy. They may solve the numerical issue, but clearly introduce a significant bias in the value of rho
.
2) Is the reason that (1) and (2) converge to the same asymptotic value because the total rho
never falls out of bounds when B is very large, so we never actually shift the value during rectification?
3) Does examining the distribution of scores
give us any useful information on a rectification strategy? For instance, if the mean score is 1.1 but the std of all the scores is 0.3, then it's reasonable the mean converges to something <1.0 as B increases. But if the mean is something like 10.0 with a std of 0.1, then snapping to 0.0 seems correct since something is clearly wrong with those values.
^^ All that being said, my feeling is that method (1) seems most appropriate in most cases, since it converges to a correct value in the large B limit, and doesn't throw away all the correlation information when the estimate is noisy.
I would be interested to see though, does the estimate of rho
generally exceed -1/+1 when the "true" rho
is near the boundary? Or is the estimate just very very noisy in general?
- Since the methods (3) and (4) are not asymptotically correct in the limit of large B, I don't think they should be considered for the rectification strategy. They may solve the numerical issue, but clearly introduce a significant bias in the value of
rho
.
Agreed. They were based on a guess about how the calculation worked, which turns out to be incorrect. Methods 3 and 4 are out.
- Is the reason that (1) and (2) converge to the same asymptotic value because the total
rho
never falls out of bounds when B is very large, so we never actually shift the value during rectification?
Yes. The jackknife-based calculations of variance and covariance never give unphysical results (variance < 0 or |correlation coefficient| > 1) on their own. Including the finite-B bias correction term makes the result more correct on average, but also introduces noise and can yield unphysical results on some predictions. The bias correction term has a factor of 1/B, so it goes to 0 as the number of bags increases, and method (1) and (2) become equivalent.
- Does examining the distribution of
scores
give us any useful information on a rectification strategy? For instance, if the mean score is 1.1 but the std of all the scores is 0.3, then it's reasonable the mean converges to something <1.0 as B increases. But if the mean is something like 10.0 with a std of 0.1, then snapping to 0.0 seems correct since something is clearly wrong with those values.
That's a fair question, so I did a little probing. I ran a trial with 128 training points, 128 test points, the Friedman-Silverman function, and quadratically correlated data. I did this for 128 bags and for 2048 bags (which is basically converged). The figure below plots rho with 2048 bags (converged) vs rho with 128 bags (noisy). The dashed line is y = x
When rho_128 has a large absolute value, does this mean that the more correct value, rho_2048, is also likely to have a larger absolute value? I took the mean of |rho_2048| over all test points and over just those for which |rho_128| > 1. The results are 0.73 and 0.74; almost no difference. This means that if we estimate rho with 128 bags and get 1.7, that doesn't necessarily mean that we can confidently say the true value of rho is close to 1.
What about the spread of individual scores? Can we use that information to form an estimate of rho that is more knowledgeable than setting rho = 0? I think not, because the scores from each training datum vary widely. They're individual terms in a sum that are expected to cancel each other. They're not draws from an underlying distribution centered about the mean, so the spread of individual values doesn't tell us how close we likely are to the true value. To take an arbitrary prediction with 2048 bags as an example, it yields rho = 0.85. The mean of the terms in the sum is 0.85 / 128 = 0.0066. The standard deviation of these terms is 0.2, which is 3 times the mean. With 128 bags, it's typical for the standard deviation of the terms to be 20 times the mean.
^^ All that being said, my feeling is that method (1) seems most appropriate in most cases, since it converges to a correct value in the large B limit, and doesn't throw away all the correlation information when the estimate is noisy.
I disagree. Method 1 may converge eventually, but it's dramatically wrong if B is not large. Setting rho = +/- 0.999 is just a bad idea. It concentrates the probability density along a narrow band, and hence is a statement of high certainty that is not warranted. Setting rho = 0 may throw away information, but it also converges to the correct value and is a safer, more accurate bet in general.
If the estimate of rho is frequently out of bounds, it just means that more bags are needed. There might be a better heuristic that we can come up with, but given the above analysis it won't be obvious. And since the jackknife method seems to underperform anyway, I don't think it's worth spending more time on.
As part of using the jackknife approach to calculate covariance, it is possible that the noisy bias correction term leads to an unphysical result. Specifically, it can predict a correlation coefficient (rho) that is < -1 or > 1, both of which are nonsensical and cause numerical issues (we cannot draw from the resulting distribution). This is similar to how the estimates of variance can yield a negative number because of sampling noise, which we solve through a heuristic rectification procedure.
I studied 4 ways to resolve this issue.
These four approaches can be seen here. I used the correlations study harness to study the effect of these four methods. The figures below can be reproduced by running main.
For each figure below, the underlying data are generated from the Friedman-Silverman function. In addition to the output of the function, a corresponding set of data is generated that has a linear correlation coefficient of 0.9 with the outputs. A corresponding set of data is also generated that has a quadratic relationship with the function output. Finally, normally distributed noise with scale 1.0 is added to all data. There are 128 test and 128 train points. The number of bags is varied, and we plot either the negative log probability density (NLPD) or the standard confidence, calculated over the test data.
As the number of bags increases, the bias correction term vanishes and both methods 1 and 2 should approach the same, "correct" answer. The ideal method will approach this limit as quickly as possible. We find that method 2 performs best, and that's what we already have implemented, so we keep it.
Here is the NLPD when estimating the correlation coefficient for predictions made on the output and the linearly correlated data. As expected, methods 1 and 2 approach the same limiting value. But method 2 performs well even when the number of trees is small. Methods 3 and 4 are less accurate.
The result is similar when making estimating the correlation coefficient between the output and the quadratically correlated data.
If we plot standard confidence, which should be 68%, we see that method 2 underestimates and methods 3 and 4 overestimate. Method 1 approaches the same value as method 2, but again requires too many bags.
The above results were all on prediction intervals, which incorporate noise. If we consider confidence intervals, which attempt to capture the true value, the jackknife variance estimate doesn't perform as well, and this shows through in the analysis of the estimated correlation coefficient. Methods 3 and 4 have a lower NLPD value, but I believe that is due to cancellation of errors. Method 2 is still better than method 1 (it's NLPD is lower and it's closer to the limiting value).
*I set the allowed range to [-0.999, 0.999] instead of [-1, 1]. Although correlation values of +/- 1 are legitimate, they lead to degeneracies and numerical issues. And for all practical purposes, 0.999 is the same as 1.0 in this case. *If the two outputs have variance
V_x
andV_y
, then the covariance cannot exceed `sqrt(V_x V_y)in magnitude. Since
Nterms are summed together, we might expect to bound the magnitude of each term by
sqrt(V_x * V_y)/N`. But since methods 3 and 4 don't approach the correct limiting values, I conclude that it is expected and correct for some individual terms to be larger.