Peer Review #2 - Githubissues

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service they offer to the community.

Distill is grateful to Nick Moran for taking the time to review this article.

General Comments

Graph networks are a growing subfield, and definitely a topic that needs a clear introduction. Many existing introductions are overly technical and difficult for a newcomer to parse. While I do have some complaints about this article as it stands, I also think it has a great deal of potential, and can be an extremely valuable contribution to Distill and the community at large with some adjustments. In my review form, I’ve marked a few areas as a 3, but I really feel that there is a clear path to improvement in these areas and with a bit of revision, I think this article will be excellent.

Overall, I think this is a good introduction to graph neural networks, but I do feel there are some areas where the article could be made more approachable for readers. In particular, I think drawing stronger parallels to convolutional networks early in the article, perhaps accompanied by visualizations showing how a pixel grid can be interpreted as a graph and how common graph network operations work when applied to a familiar pixel grid would greatly improve approachability for readers who are familiar with CNNs but are new to graph networks. Even readers who are not as familiar with CNNs would likely benefit from seeing examples on a familiar grid structure with a concrete application in mind.

I think the section on spectral convolution is a big difficulty jump over the rest of the article, and will probably cause a fair number of readers to tune out. Certainly this is the most mathematically advanced section, and so is going to be the most difficult to explain in an approachable way, but I think there is at least some room for improvement. Drawing comparisons to the image Laplacian may help readers who are new to the concept, as the image Laplacian is much easier to visualize.

I think the Game of Life experiments are interesting, but more work is needed to clearly communicate to the reader what conclusions should be drawn. There is also an opportunity to drill a bit deeper into the learned weights and look at how similar or different the resulting CNNs and GCNs are.

The following are specific comments to sections of the article. Some are low-level suggestions for sentence re-wording, and others are higher level suggestions about article structure.

“However, neural networks have been traditionally used to operate on fixed-size and/or regular-structured inputs (such as sentences, images and video). This makes them unable to elegantly process graph-structured data.”

It may be worth being more explicit about what properties of traditional network architectures make them unsuited for arbitrary graph structures. This is answered in the next section, but this sentence sort of leaves the reader hanging and implies that the reasons are obvious.

“GNNs can make more informed predictions about individual entities in these interactions.”

More informed than what?

“With this lack of inherent order, we make an adhoc choice of ordering. For this reason, we must ensure our algorithms are invariant to our choice of ordering of the nodes. Otherwise, choosing a different ordering of nodes would lead to different predictions.”

It’s not totally clear that this should be a deal-breaker - after all, even image classification networks are not invariant to different views of the same object. It may be helpful to the reader to expound a bit more on the problems that come from not having an inherent ordering.

Another question a reader might be left with is why, even if there is no inherent ordering, we cannot easily construct a canonical ordering. I’m not sure how easy this difficulty is to explain, though. It is probably not worth getting into a graph isomorphism rabbit hole.

“Such methods will have their number of parameters independent of the size of the graph.”

This sentence is a little stilted and could be rephrased.

“See Learning GNN Parameters for how the learnt embeddings can be used for these tasks!”

The link in this sentence seems to suggest to the reader that they should jump ahead to that section, but they will likely be lost if they do so. It might be worth rephrasing to something like “In a later section, we will see how these embeddings can be used to perform higher level tasks”.

“Nodes can have individual features”

It may be worth giving an example of what a node feature is. Are these features part of the input graph, or are they something we compute, or both? Someone coming from CNNs may expect the word “feature” to indicate a computed descriptor rather than a raw input.

“In case some nodes lack features, x_v is set as some predefined constant.”

I think this sentence adds more confusion than clarity - we don’t even know what a node feature is yet, and so it’s not clear why a node would lack features. It may be better to just explain this convention later if it becomes important for a particular example.

“For an arbitrary matrix (or vector) M, M_v denotes the row corresponding to vertex v wherever applicable.”

This sort of comes out of left field. It may be better to say something like, “Sometimes we will wish to describe a property of a graph using a matrix, where each row represents a particular vertex.” so that the reader knows why we have this notation.

“Extending Convolutions to Graphs”

I think it may be helpful to begin this section by drawing a pixel grid as a graph, with RGB values as node features. This would give a reader who is familiar with CNNs but unfamiliar with graph networks a clear bridge between the two. It would also help illustrate why regular convolution does not trivially carry over. (Something like, “How can we construct convolution filters when each ‘pixel’ can have a different number of neighbors in a different arrangement?”)

“The Graph Laplacian”

I think this section is going to hit the reader like a ton of bricks. So far everything has been very gentle, and this section just starts constructing a weird matrix and doing all sorts of weird multiplications with it. I think many readers will be lost here.

One approach to soften this transition may be to begin with a review of the image laplacian. Describe how it detects smoothness vs edges in an image, and explain that we would like to do the same for a graph. This would make it clear why we’re constructing L the way we are, and why this is likely to be a useful operator. As it is now, the reader has to slog through a bunch of intimidating math before they get the payoff of seeing why.

You could even move the section about Fourier analysis up a bit, and show a low frequency basis image and note that it is very smooth, as compared to a high frequency basis image that is not smooth. This would help give the reader a visual understanding of the sort of property we’re looking for.

“It turns out that these eigenvectors of L are successively less smooth, as R_L indicates:”

I think this notion of smoothness may be confusing - we previously associated smoothness with nodes, and it was intuitively clear what that meant - nodes that have similar feature descriptors to their neighbors. I think readers who bought into this intuitive understanding may have trouble understanding what it means for an eigenvector to be smooth. A bit of hand-holding to explain what these eigenvectors represent and what it means for them to be smooth may be helpful.

(Eigenvector interactive)

When hitting ‘reset’, the sliders for the x-hat values do not return to the zero point. It may be useful to draw the image for x-hat_i = 1 (all others = 0) above each slider so that the reader can see the different components that are being intermixed. It might also be useful to show their R_L values so we can see that the earlier ones are smoother.

“Convert current filter w(k) to its spectral representation w^(k).”

I think this is the first we’re hearing about filters. There should probably be some explanation that says that just like we try to learn a set of filters in a conv net, we’re going to be learning a set of filters for our graph net, and all this work we did with the laplacian and spectral decomposition is going to show us how to perform a convolution with those filters. It may also be worth describing what a filter looks like for a graph. Image convolution filters are easy to think about visually - I just have a 3x3 (or whatever size) little block, but I think many readers will have trouble internalizing the concept of a filter for an arbitrary graph.

“ChebNet takes this idea even further, by instead looking at polynomials of the form:”

I’m not sure whether this section on ChebNet is necessary. There’s already a lot to take in, and the description of ChebNet is pretty high level. I think it’s likely to leave readers confused. It might be worth cutting it down significantly, and just saying something like “Further improvements on this basic approach are made by techniques such as ChebNets which offer the following benefits…”

“Embedding Computation”

The color coding in the figure describing the four different architectures is really difficult for a colorblind reader.

It may be unclear to the reader that the text after the diagram also changes when you flip to different architectures. Depending on screen size, the text may not be visible when viewing the diagram, and so the reader may not see that they should re-read the following paragraph for each different architecture. Maybe having some sort of visual cue like a box or different background color that indicates that the whole section is controlled by the buttons would help with this.

For a newcomer, I think these approaches are much easier to understand than the approach using spectral representations - the building blocks here are much more familiar, and their combination feels very natural. I wonder if there is a way to re-order the presentation to begin with these friendlier techniques? I know that the spectral approach and ChebNets came first, but I think there’s some value in giving the reader a monotonic increase in complexity over the course of the article.

“Interactive Graph Neural Networks“

This is really the highlight of the whole article. Really great work.

“once a suitable loss function L is defined:“

I think most readers will not be confused here, but we should be careful about re-using L to mean loss instead of laplacian. Maybe a script L or something could be used to distinguish them?

“In this spirit, [20] defines a minimal 3-layer CNN model that can solve the Game of Life Problem perfectly. However, they find that only CNN models with many more parameters can actually find the optimal solution consistently, when trained with stochastic gradient descent.”

These sentences are a little confusing. It might be better to say something like the paper defines an architecture with sufficient capacity to model Game of Life Problem with hand crafted weights, but SGD struggles to discover that solution from random initialization.

“The Game of Life”

In general, I’m not terribly impressed by this section of the paper. I don’t feel like I’m learning a lot about how GCNs work from these results. I think the underlying message should be something like, on a pixel grid, a GCN is a subset of what a CNN can represent (I think the GCN is essentially enforcing that the weights of neighboring pixels all have the same value?). It might be instructive to drill down into the weights learned by the GCN and show what they look like as a CNN, compared to a similarly-behaving CNN example.

The best seed/worst seed visualization feels a little odd. I guess the point is that the two architectures have similar bad local optima that they can get trapped in, but it feels like a sort of hand-wavy comparison to say that the behavior on the worst-performing seeds is qualitatively similar.

I’m not sure exactly how I would improve this section, but I think the biggest thing it needs is a clear thesis about what lesson the reader should learn from this experiment, and be restructured a bit to more directly serve that thesis.

“Practical Techniques for GNNs”

It may be helpful to add a sentence here noting that CNNs can take advantage of efficient vectorized convolution on GPUs, and that we’d like to do the same for graph networks, just to give a little more clarity as to what is meant by “practical techniques”.

Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest What type of contributions does this article make?: Exposition on an emerging research direction

Advancing the Dialogue	Score
How significant are these contributions?	3/5

Outstanding Communication	Score
Article Structure	3/5
Writing Style	3/5
Diagram & Interface Style	4/5
Impact of diagrams / interfaces / tools for thought?	4/5
Readability	3/5

Comments on Readability

I feel that the section on Spectral Convolutions is likely to be a bottleneck for many readers. Most of the article does a good job of giving approachable explanations, but I think the spectral convolutions section is going to be difficult for people who are not experts. There's a lot of dense mathematical notation, and not a lot of intuition building. On the other hand, it is just fundamentally a challenging concept that requires a certain amount of advanced mathematical background. I'm not sure how feasible it is to reduce the complexity of this section, but I think it's at least worth another look.

Scientific Correctness & Integrity	Score
Are claims in the article well supported?	4/5
Does the article critically evaluate its limitations? How easily would a lay person understand them?	4/5
How easy would it be to replicate (or falsify) the results?	5/5
Does the article cite relevant work?	4/5
Does the article exhibit strong intellectual honesty and scientific hygiene?	4/5

Comments on Scientific Integrity

I feel that the section comparing CNNs and GCNs on a game of life task could be clearer about the conclusions the reader should draw, and what the basis for those conclusions is. As it stands right now, the comparison feels somewhat loose, and relies on qualitative visual comparison of outputs.

distillpub / post--understanding-gnns

Peer Review #2 #2

General Comments