Closed JellePiepenbrock closed 6 years ago
This answer is by Guy Wolf:
First, to see why this blog is misleading, let's consider for a moment a uniformly sampled n-dimensional ball of radius 2R. I assume we can all visualize it in our mind, and clearly it has a constant density everywhere in its interior - so there are no "concentrations" of density. Now, given many many points sampled from this ball, the number of points in a given subregion of the ball is directly given by volume of the region multiplied by that constant density, which we will take to be 1 for convenience. Then, let's consider two regions - the central ball of radius R with volume V_inner, and the surrounding region, which is the ball minus the central one, with volume V_outer. Then, it is not difficult to verify that V_outer / V_inner = 2^n - 1, so basically very quickly as we increase n, only a negligible amount of points will lie in the inner region. Now, the same argument can be made with radius (1 + epsilon) R where the outer region is of "width" epsilon*R with small epsilon and the inner one is of radius R, and we'd get (1+epsilon)^n - 1, which still grows exponentially with the dimension n. So by that argument we can claim that uniform balls also look like hollow soap bubbles, which hopefully we can all agree is nonsense. The main failure of the argument stems from the histogram in the blog (and all related arguments) being phrased as a function of a norm - i.e., the distance from the center - rather than density. Ten points right at the center would be significantly denser than a 100 points spread over the surface of a unit circle, 1000 over the surface of a sphere, etc, even through the average distance would clearly be pulled more and more towards the larger number of points on the surface.
Now, the argument about interpolation is not really motivated well in the blog, but essentially the interpolation there is more about maintaining the variance in the data than anything else. This could be indeed be useful for generative purposes where, say, we want to double the amount of samples we have via a VAE. So, we learn an encoder that maps N images to N samples from a standard zero-mean Gaussian in a good way that can be decoded (with added noise) to give reasonable images. Then, just interpolating N additional points between them would estimate a distribution with zero mean but a smaller variance, which would introduce some bias in the decoded distribution. Doing so in polar coordinates would maintain the same variance and would get something closer to the learned internal Gaussian distribution, and therefore provide better decoded distribution. Notice, however, that this is very far from what MAGIC does, as we explain next.
As for any supposed connection with MAGIC - MAGIC does not assume data is sampled from a Gaussian distribution, does not generate / interpolate new data points (rather it corrects sampled values in the input data points), and does not try to preserve a Gaussian variance of the data. The manifold assumption in MAGIC means that we assume that within local regions the data is concentrated on low-dimensional patches, and high dimensional deviations from these patches are noise that we want to remove. Therefore, we build a data-driven low-pass filter that removes the high dimensional noise and pushes the data to a manifold of such locally low dimensional patches. Further, we do not assume the noise is Gaussian (or any parametric noise model), but rather than there is sufficient density of data points within local regions to (implicitly) identify and characterize tangent spaces of said manifold. The use of normalized Gaussian neighborhoods is convenient for various mathematical reasons (e.g., like Gaussian filters used in signal processing), but they are not serving as distributions for random sampling or as a model for the noise or distributions (even locally) of the data.
How does MAGIC take into account the fact that most of the density of a high-dimensional volume is concentrated on the surface? Let's say that for log-transformed sc-RNA-seq data, you have a ~20.000 dimensional Gaussian (assuming that log-transform is applicable for now).
I'm asking after reading this blog post:
http://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/
In which especially the 'Interpolation' part is relevant. If I understand MAGIC correctly, every cell kind of becomes the Euclidean mean of its neighbors (weighted average by distance). I'm wondering whether the properties of high-dimensional volumes have any bearing on how MAGIC does this. See the example with the polar interpolation in the link above.
Thanks in advance,
Jelle