Open distillpub-reviewers opened 6 years ago
Thank you very much for your detailed review! We are sorry for our late response - we agree with your feedback and will make changes to the article to reflect this. Currently, we are waiting for the other reviews before making major changes. Many of your specific comments should be straightforward to fix.
[major] To speak more broadly on this point, it is worth mentioning that GP regression is typically done in conjunction with a noise model. This allows the posterior some flexibility such that it does not need to pass through every observation (indeed, this constraint makes GPs nearly unusable). This addition requires very little extension to the GP framework and should be discussed explicitly.
[minor] I do think this is what the authors seem to be getting at in the conclusion, at the line ""If we need to deal with real-world data, we will often find measurements that are afflicted with uncertainty and errors."" The cited article, however, seems to be more sophisticated than what is needed to address the basic problem.
As you have mentioned, the noise model for the input is a very important aspect of Gaussian processes because this is what makes them practical. We agree that in the current version this is not covered adequately and we plan on adding an additional paragraph, together with a corresponding figure.
I think the article could benefit from a hero figure that is a true “GP sandbox”. It would be nice to be able to combine kernels, change the parameters of the kernels, and add/remove data points in real time, all in one place.
This sounds like a really interesting idea! Ideally, we would want to add an additional example that is more tangible (e.g. temperature data) and also has a larger number of training points. In that regard, we need to see what is possible with our current JavaScript implementation.
PR #2 fixes some of the comments of R1:
[minor] based on prior knowledge -> incorporating prior knowledge
[minor] The statement [Gaussian processes assigns a probability to each of those functions] needs a citation.
[typo] symmetric -> symmetric
[minor] In the section “kernels”, L2-norm should be L2-distance
I will make additional changes in the next couple of days.
Commit 48aa6441904f6e82150924f970ff999296048960 fixes the following:
[nit] I am not a fan of the notation P(X,Y) used to denote the joint distribution of X and Y. This appears to be interchangeable with p_{X,Y}, used elsewhere in the article to refer to the joint density. These two notations refer to essentially the same thing.
Commit 5752d96144c8535c7d564367389abb59de0f07de fixes the following:
[nit] Though it is clear from context what the article means, the article needs to be careful about distinguishing between probabilities and probability densities. e.g. in statements like “we are interested in the probability of X=x”
[major] To speak more broadly on this point, it is worth mentioning that GP regression is typically done in conjunction with a noise model. This allows the posterior some flexibility such that it does not need to pass through every observation (indeed, this constraint makes GPs nearly unusable). This addition requires very little extension to the GP framework and should be discussed explicitly.
[minor] I do think this is what the authors seem to be getting at in the conclusion, at the line ""If we need to deal with real-world data, we will often find measurements that are afflicted with uncertainty and errors."" The cited article, however, seems to be more sophisticated than what is needed to address the basic problem.
PR #18 adds more information about extending Gaussian processes to input noise.
[minor] Periodic kernels are not invariant to translations but invariant to translations equal to the period >of the function.
[major] Kernels can be combined by addition, multiplication and concatenation (see Chapter 5.4 of Mackay, Information Theory, Inference, and Learning Algorithms). The omission of addition as a way of combining kernels is surprising.
[minor] Furthermore, it would be nice to see an explanation of what it means when two kernels are combined. The resulting kernels inherit the qualitative properties of its constituents, but how, and why?
PR #11 adresses the above issues, adds additional information on kernels, and includes a new static figure.
The following peer review was solicited as part of the Distill review process.
The reviewer chose to keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.
Distill is grateful to the reviewer for taking the time to review this article.
This article was a delight to read, and hits many of the right notes in what I think makes a good Distill article:
What is unclear to me, however, is if this article breaks new ground in exposition. A lot of space e.g. is devoted to exposition on basic properties of the multivariate Gaussian distributions. Many of the figures correspond to visualizations that I already have in my head (though this is probably a good thing!), and the depth of the material on GPs themselves are not beyond what is covered in introductory courses. What I’m saying is that a reader already familiar with GPs is unlikely to take away anything from this. A new reader, however, who has not developed this visual “muscle”, would likely take away a lot. I will thus leave it to the editors to decide if Distill is an appropriate venue for such an article.
Personally, however, I think it is a solid piece of writing and thus I recommend it for publication.
Specific comments
[minor] based on prior knowledge -> incorporating prior knowledge
[minor] The statement [Gaussian processes assigns a probability to each of those functions] needs a citation.
[typo] symmetric -> symmetric
[nit] I am not a fan of the notation P(X,Y) used to denote the joint distribution of X and Y. This appears to be interchangeable with p_{X,Y}, used elsewhere in the article to refer to the joint density. These two notations refer to essentially the same thing.
[nit] Though it is clear from context what the article means, the article needs to be careful about distinguishing between probabilities and probability densities. e.g. in statements like “we are interested in the probability of X=x”
[minor] In the section “kernels”, L2-norm should be L2-distance
[minor] Periodic kernels are not invariant to translations but invariant to translations equal to the period of the function.
[major] Kernels can be combined by addition, multiplication and concatenation (see Chapter 5.4 of Mackay, Information Theory, Inference, and Learning Algorithms). The omission of addition as a way of combining kernels is surprising.
[minor] Furthermore, it would be nice to see an explanation of what it means when two kernels are combined. The resulting kernels inherit the qualitative properties of its constituents, but how, and why?
[major] There seems to be a problem in the figure right above “Conclusion”, when the linear kernel is used alone. The linear kernel has rank 2, and thus the distribution after conditioning is not well defined if there are more than 2 points observed. The widget, however, does not seem to complain when more than two points are added. What is going on? I suspect that the regression has been done with a noise model, i.e. with a small multiple of the identity added to fix the singularity. This is completely valid, but it should be discussed explicitly.
[major] To speak more broadly on this point, it is worth mentioning that GP regression is typically done in conjunction with a noise model. This allows the posterior some flexibility such that it does not need to pass through every observation (indeed, this constraint makes GPs nearly unusable). This addition requires very little extension to the GP framework and should be discussed explicitly.
[minor] I do think this is what the authors seem to be getting at in the conclusion, at the line ""If we need to deal with real-world data, we will often find measurements that are afflicted with uncertainty and errors."" The cited article, however, seems to be more sophisticated than what is needed to address the basic problem.
A final thought
I think the article could benefit from a hero figure that is a true “GP sandbox”. It would be nice to be able to combine kernels, change the parameters of the kernels, and add/remove data points in real time, all in one place.
Distill employs a reviewer worksheet as a help for reviewers.
The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.
Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest What type of contributions does this article make?: Explanation of existing results