distillpub / post--gnn-intro

Apache License 2.0
94 stars 28 forks source link

Peer Review #1 #1

Open distillpub-reviewers opened 3 years ago

distillpub-reviewers commented 3 years ago

GNN Intro Review

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service they offer to the community.

Distill is grateful to Chaitanya K. Joshi for taking the time to review this article.


General Comments

Introduction

Researchers have developed neural networks that operate on graph data (called graph neural networks, or GNNs) for over a decade, and many recent developments have increased their capabilities, and have now been used in many practical applications.

So many and's...maybe reframe it into 3 parts: GNNs have been developed for over a decade. Recent developments have focused on increased capability (and maybe expressive power/representational capacity?). In the past years, we are starting to see practical applications (and maybe some examples, like Halicin-MIT-Broad Institute, Fake News-Twitter, RecSys-Pinterest).

To start, let’s establish what a graph is. A graph represents the relations (edges) between a collection of entities (nodes).

The associated diagram may be better presented if the global attributes row is the last one...

To further describe each node, edge or the entire graph, we can store information in each of these pieces of the graph to further describe them. We can further specialize graphs by associating directionality to edges (directed, undirected).

This sentence needs editing to make less use of the word 'further'!

Graphs and where to find them

You’re probably already familiar with some types of graph data, such as social networks. However, graphs are an extremely powerful and general representation of data, and you can even think of images and text as a graph.

Images as graphs

It can be confusing to start with images and text...maybe its better to start with real-world graphs and then later mention images/text as add-ons. Essentially, images/text are euclidean data while most graph data is irregular and non-euclidean. So GNNs would definitely not be the top choice for anyone working with them. Thus, speaking about images/text as graph structures, while interesting, may diverge from the main point of the article, which is to be a gentle introduction to GNNs. (P.S. the visualizations for image/text graphs are super good, though!)

This refers to the way text is represented in RNNs; other models, such as Transformers, where text can be viewed as a fully connected graph. See more in Graph Attention Networks.

This sentence needs editing for clarity.

In the graph grid/adjacency matrix, the color scheme can be better as both the blues are very similar.

Summary statistics on graphs found in the real world. Numbers are dependent on featurization decisions. More useful statistics and graphs can be found in KONECT

Maybe, instead of just giving the title of the dataset, you can talk about the data domain...e.g. QM9 - molecular graphs, Cora - academic citation network, etc. As this table is being referred to in later parts of the article and is also allowing the reader to really grapple with the complexity of graph datasets out there, it would be great to present this one better.

One example of edge-level inference is in image classification. Often deep learning models will identify objects in images, but besides just identifying objects, we also care about the relationship between them.

Could it be better to use as an example something more common like predicting possible friendships on a social network?

The challenges of using graphs in machine learning

Two adjacency matrices representing the same graph.

If the authors are planning to add interaction to this figure, it would be interesting if, e.g. I highlight one row on the left adjacency matrix and the corresponding row on the right adjacency matrix is also activated.

One elegant and memory-efficient way of representing sparse matrices is as adjacency lists. These describe the connectivity of edge ek between nodes ni and nj as a tuple (i,j) in the k-th entry of an adjacency list. They are O(nedges), rather than O(n2nodes), and since they are used to directly index the node information, they are permutation invariant.

As a practitioner, I can quickly make the link between why we want input formats to be O(n_edges) rather than O(n_nodes^2). However, it may be better to frame this in simple english as opposed to Big-O notation. Alternatively, it may be worth introducing the idea that adjacency matrices on their own are O(n_nodes^2) earlier.

Graph Neural Networks

We’re going to build GNNs using the “message passing neural network” framework proposed by Gilmer et al. using the architecture Graph Nets schematics introduced by Battaglia et al.

Isn't the graph nets framework already encompassing MPNNs? E.g. If I say I'll build GNNs based on Graph Nets from Battaglia et al., it may be sufficient already?

Also, ""architecture Graph Nets schematics"" --> ""Graph Nets architecture schematics""?

With the numerical representation of graphs that we’ve constructed above, we are now ready to build a GNN. We will start with the simplest GNN architecture, one where we learn new embeddings for all graph attributes (nodes, edges, global), but where we do not yet use the connectivity of the graph.

Maybe I am being nitpicky/the authors have made the choice for pedagogical reasons, but, at this point in the article, they are introducing the concept of vectors/embeddings as features per node/edge/global. Previously, all these features had been scaler values, so I wonder if the sudden change will confuse readers? E.g. the diagram with the caption 'Hover and click on the edges, nodes, and global graph marker to view and change attribute representations. On one side we have a small graph and on the other the information of the graph in a tensor representation.'

I would suggest (preferably) using feature vectors from the start across all diagrams, or making a note about this to explain to readers.

However, it’s not always so simple. For instance, you might have information in the graph stored in edges, but no information in nodes, but still need to make predictions on nodes.

Consider giving an example? I had a hard time thinking of one, but maybe biological interaction networks exhibit this particular scenario.

If we only have node-level features, and are trying to predict binary edge-level information, the model looks like this.

Examples would be nice to help readers.

One solution would be to have all nodes be able to pass information to each other, but for large graphs, this quickly becomes computationally expensive (although this approach, called ‘virtual edges’ has been used for small graphs, like molecules).

Sentence can be broken into two for clarity.

I also have a broader comment on this section: in the previous section, the reader spends a lot of time understanding what is an edge list and its advantage over the adjacency matrix format. This is great, because this is how many graph libraries are processing graphs, e.g. NetworkX, PyTorch Geometric. However, how does this edge list format link to the current section? You have described message passing, but how is the edge list actually used for message passing?

I think the reader would be interested to connect the two sections of this article together, e.g. you could consider describing how one could do a simple round of message passing with the edge list format. (On a tangential note, it may also be useful to show how a matrix multiplication of the adjacency and feature matrix also implements message passing with a summation aggregation.)

GNN Playground

Scatterplot of a hyperparameter sweep of several GNN architectures. Hover over a point to see the GNN architecture parameters.

What did we learn about GNN design through this exercise? Are there any global insights about GNN architecture design choices that one can draw from this experiment, e.g. does global node help? And do these intuitions line up with some recent works on benchmarking and comparing GNN architectural paradigms, e.g. Dwivedi-etal, 2020; You-etal, 2020?

In the Final Thoughts section, the authors say ""We’ve walked through some of the important design choices that must be made when using these architectures, and hopefully the GNN playground can give an intuition on what the empirical results of these design choices are.""

The playground it very welcome but it may be nice to concretely state some of these intuitions.

Or even just highlight what the top architectural elements were for this particular dataset.

And then discuss whether they align well/are opposed to conventional ideas in the literature.

Into the weeds

In general, I would have liked to see more citations to recent work and new ideas in GNN literature in this section. Figures would also be nice.

Other types of graphs (multigraphs, hypergraphs, hypernodes)

There are several recent and interesting works generalizing GNNs for hypergraphs and multigraphs that could be mentioned here. One recent I am aware of is Yadati-etal, 2019.

Batching in GNNs

It may be worth talking about/citing prevalent GNN sampling algorithms in literature, e.g. GraphSaint, ClusterGCN.

Inductive biases

It may be interesting to speak about the link between inductive biases and generalization/extrapolation beyond training distribution, e.g. recent work on GNNs for neural execution of graph algorithms by groups from DeepMind (Petar Velickovic's work) as well as MIT (Keyulu Xu's work).

Since this operation is one of the most important building blocks of these models, let’s dig deeper into what sort of properties we want in aggregation operations, and which types of operations have these sorts of properties.

Missing text after this?"


Structured Review

Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest What type of contributions does this article make?: General (Introduction to an emerging research topic)

Advancing the Dialogue Score
How significant are these contributions? 4/5
Outstanding Communication Score
Article Structure 3/5
Writing Style 3/5
Diagram & Interface Style 3/5
Impact of diagrams / interfaces / tools for thought? 3/5
Readability 3/5

Comments on Communication:

I think there are a few places where the writing may be polished (and I've mentioned these in my long-form comments).

The article structure is coherent overall, but there are places where I feel the various sections lack a sense of harmony/continuity with each other.

The diagrams are well designed and useful for understanding the concepts.

Scientific Correctness & Integrity Score
Are claims in the article well supported? 3/5
Does the article critically evaluate its limitations? How easily would a lay person understand them? 3/5
How easy would it be to replicate (or falsify) the results? 4/5
Does the article cite relevant work? 3/5
Does the article exhibit strong intellectual honesty and scientific hygiene? 3/5

Comments on Scientific Correctness & Integrity:

On ""Does the article critically evaluate its limitations? How easily would a lay person understand them?"" --> I am not sure this is relevant for this particular article.

The GNN playground interactive diagram in this article is really worth commending and would fit right in with my understanding of what good Distil articles should do. However, I would have like to see it accompanied with the authors discussing their findings via their playground tool. I have emphasized this in my long-form review.

beangoben commented 3 years ago

First we respond to several points our reviewer made:

Introduction

So many and's...maybe reframe it into 3 parts: GNNs have been developed for over a decade. Recent developments have focused on increased capability (and maybe expressive power/representational capacity?). In the past years, we are starting to see practical applications (and maybe some examples, like Halicin-MIT-Broad Institute, Fake News-Twitter, RecSys-Pinterest).

We took this into account and rewrote the paragraph.

The associated diagram may be better presented if the global attributes row is the last one…

Agreed, we have reorder this part of the visualization.

This sentence needs editing to make less use of the word 'further'! We have edited the text to avoid redundant words.

Graphs and where to find them

It can be confusing to start with images and text...maybe its better to start with real-world graphs and then later mention images/text as add-ons. Essentially, images/text are euclidean data while most graph data is irregular and non-euclidean. So GNNs would definitely not be the top choice for anyone working with them. Thus, speaking about images/text as graph structures, while interesting, may diverge from the main point of the article, which is to be a gentle introduction to GNNs. (P.S. the visualizations for image/text graphs are super good, though!)

Our intention was to start the reader with data they are more likely to be familiar with (images/text) and show how these can be viewed as graphs. Part of the point is that graph representations in these domains are not optimal. We have clarified our intent at the beginning of the section "Graphs and where to find them".

This sentence needs editing for clarity.

Edited the caption to clarify.

In the graph grid/adjacency matrix, the color scheme can be better as both the blues are very similar.

Agreed, we have increased the contrast between shades of blue to better differentiate the graphical elements.

Maybe, instead of just giving the title of the dataset, you can talk about the data domain...e.g. QM9 - molecular graphs, Cora - academic citation network, etc. As this table is being referred to in later parts of the article and is also allowing the reader to really grapple with the complexity of graph datasets out there, it would be great to present this one better.

We have expanded this table to also include the data domain.

Could it be better to use as an example something more common like predicting possible friendships on a social network?

We thought about it when picking tasks, and decided on an image-related example since it might be less expected.

The challenges of using graphs in machine learning

If the authors are planning to add interaction to this figure, it would be interesting if, e.g. I highlight one row on the left adjacency matrix and the corresponding row on the right adjacency matrix is also activated.

This is an interesting idea but we did not plan to add interactivity to this figure

As a practitioner, I can quickly make the link between why we want input formats to be O(n_edges) rather than O(n_nodes^2). However, it may be better to frame this in simple english as opposed to Big-O notation. Alternatively, it may be worth introducing the idea that adjacency matrices on their own are O(n_nodes^2) earlier.

We took note of the reviewers' point on simplifying the language for this paragraph.

Graph Neural Networks

Isn't the graph nets framework already encompassing MPNNs? E.g. If I say I'll build GNNs based on Graph Nets from Battaglia et al., it may be sufficient already?

We agree GraphNets already encompasses MPNNs. We would like to keep the redundancy and mention both frameworks due to their importance (they brought new ways of looking at GNNs) and also the naming (Message Passing) might help to communicate one lens in which GNNs can be viewed.

Maybe I am being nitpicky/the authors have made the choice for pedagogical reasons, but, at this point in the article, they are introducing the concept of vectors/embeddings as features per node/edge/global. Previously, all these features had been scaler values, so I wonder if the sudden change will confuse readers? E.g. the diagram with the caption 'Hover and click on the edges, nodes, and global graph marker to view and change attribute representations. On one side we have a small graph and on the other the information of the graph in a tensor representation.' I would suggest (preferably) using feature vectors from the start across all diagrams, or making a note about this to explain to readers.

We added text to clarify our figure. We had used scalar for pedagogical reasons, using vectors would make the figures very information dense. We added a paragraph for clarification. Additionally vectors are introduced in the introduction section to better clarify this point.

Consider giving an example? I had a hard time thinking of one, but maybe biological interaction networks exhibit this particular scenario.

We added an example as an "aside" to each task, to make reference to a particular example for each scenario.

Sentence can be broken into two for clarity.

Done

I also have a broader comment on this section: in the previous section, the reader spends a lot of time understanding what is an edge list and its advantage over the adjacency matrix format. This is great, because this is how many graph libraries are processing graphs, e.g. NetworkX, PyTorch Geometric. However, how does this edge list format link to the current section? You have described message passing, but how is the edge list actually used for message passing? I think the reader would be interested to connect the two sections of this article together, e.g. you could consider describing how one could do a simple round of message passing with the edge list format. (On a tangential note, it may also be useful to show how a matrix multiplication of the adjacency and feature matrix also implements message passing with a summation aggregation.)

This is a great point, we have added a section on matrix multiplication as message passing in the "into the weeds" to connect these ideas.

GNN Playground

What did we learn about GNN design through this exercise? Are there any global insights about GNN architecture design choices that one can draw from this experiment, e.g. does global node help? And do these intuitions line up with some recent works on benchmarking and comparing GNN architectural paradigms, e.g. Dwivedi-etal, 2020; You-etal, 2020? In the Final Thoughts section, the authors say ""We’ve walked through some of the important design choices that must be made when using these architectures, and hopefully the GNN playground can give an intuition on what the empirical results of these design choices are."" The playground it very welcome but it may be nice to concretely state some of these intuitions. Or even just highlight what the top architectural elements were for this particular dataset. And then discuss whether they align well/are opposed to conventional ideas in the literature.

We definitely agree, and have added a section named "Some empirical GNN design lessons" that has an expanded interactive GNN architecture explorer with some of the insight we might derive from the exercise.

Into the weeds

In general, I would have liked to see more citations to recent work and new ideas in GNN literature in this section.

For each section we have updated citations with some additional notes about recent work.

It may be interesting to speak about the link between inductive biases and generalization/extrapolation beyond training distribution, e.g. recent work on GNNs for neural execution of graph algorithms by groups from DeepMind (Petar Velickovic's work) as well as MIT (Keyulu Xu's work).

We have added two paragraphs about future directions for GNNs that cite these works (and more) at the end of the GNN playground section.

Structured Review

The GNN playground interactive diagram in this article is really worth commending and would fit right in with my understanding of what good Distil articles should do. However, I would have like to see it accompanied with the authors discussing their findings via their playground tool. I have emphasized this in my long-form review.

We definitely agree, and have added a section named "Some empirical GNN design lessons" that has an expanded interactive GNN architecture explorer with some of the insight we might derive from the exercise.

beangoben commented 3 years ago

We thank the reviewer for their time and attention, we have taken their comments into consideration and we think our work is stronger because of them.

Next, we summarize most of the changes that we have made based on feedback from all reviewers:

Reviewer 1 made several points on improving the writing and presentation of ideas, this resulted in simplifying the language for several sentences, breaking down paragraphs and expanding examples for some concepts.

Reviewer 1 also asked us to improve on the "lessons" of the GNN playground. These lessons became the subsection ""Some empirical GNN design lessons" which details new interactive visualizations that show some of the larger architecture trends for the playground.

Reviewer 3 made a point about expanding on the connection between Transformers and also on some of the current limitations with GNNs and message passing frameworks.

All reviewers noted a few typos, latex equations errors and grammatical mistakes that we have fixed. The bibliography has expanded slightly.

For a more detailed breakdown of the changes: