[RFC] Improve dgl.nn, dgl.data and examples

jermainewang commented 5 years ago

Problem

There have been a bunch of requests and efforts (PR #744 #748 #753 Issue #719 #742) around our modules, examples and dataset. This RFC would like to discuss some details and write out the roadmap. I know there has been some internal discussions among @mufeili and @VoVAllen about the drug/chemical dataset/applications. I hope you could also summarize it a bit and improve this RFC. This RFC is likely to be broken down into several PRs (for example, #748 is a good start), so having a roadmap is quite important to track the progress.

The problem of the current DGL examples is that (1) Reusable modules are scattered in many training scripts, making them hard to be accessed by the end users. (2) We need to add way more modules and layers due to the increasing demand from the community. This has severely impeded DGL from more adoption. (3) The organization is not clean. Each example is put in its dedicated folder. We should have a better hierarchy to organize them (possibly by application type). (4) Example codes are not included in CI, making it hard to maintain.

Proposal

To fix this, this RFC proposes the following changes: (1) Move as many layer/module components as possible from examples to nn namespace. These modules should be functional, has graph object in the argument to allow dynamic graph, and has clear documentation. We also need to implement them for both mxnet and pytorch. (2) Try organize modules in the following categories.

Graph convolution (conv.py)
node-level classification/regression
link prediction
graph classification
graph pooling (glob.py, seq.py)
graph generation
loss.py
metrics.py
Other functional APIs (e.g. softmax)
...

Note that we could include any utilities that may not depend on DGL. They could be evaluation metrics, loss, score or any functions that are common and useful. However, they are still in dgl.nn namespace. (3) Curate more datasets. The question here is whether we are confident that our data format is able to cover all of them. @VoVAllen may have more opinions on this. Otherwise, we will continue the current practice. If using raw dataset, we could directly download it from the official site and construct graphs online. Otherwise, we may need to preprocess the data to generate graphs and store it in our own data format. (4) Reorg the examples by applications:
Community detection
Recommender system
Graph classification
Network embedding
...

Suggestion from @classicsong: It's better that we could have a unified entry point of each. For example:

python examples/recsys/train.py --dataset=movielens100k --model=gcmc

This allows the users to easily compare baselines.

Finally, it is a good time to finally refactor and write more docstrings for these modules. We could also take this chance to enable lint check on dgl.data module and move sampling routines to DGL main packages.

Check List

@mufeili @VoVAllen @yzh119 @zheng-da and anyone who are involved. Please directly edit this post to add items. We could estimate the workload from this.

dgl.nn

Graph convolution (dgl.nn.<backend>.conv.py)

[ ] RGCN (PR #744 )
[ ] GraphSAGE (PR #748 )
[ ] GAT
[ ] GGNN (PR #748 )
[ ] ChebyNet (PR #748 )
[ ] GIN (PR #748 )
[ ] SGC (PR #748 )

node-level classification/regression TBD

Link prediction

[ ] Bilinear decoder
[ ] Distmult

Graph generation

[ ] DGMGDecoder (?)

loss.py TBD

metrics.py TBD

Functional APIs (dgl.nn.functional) TBD

dgl.data Community detection

[ ] TBD

Graph classification

[ ] TBD

Recommender system

[ ] MovieLens

Network embedding

[ ] TBD

Examples Community detection

[ ] TBD

Graph classification

[ ] TBD

Recommender system

[ ] TBD

Network embedding

[ ] Node2vec
[ ] Metapath2vec
[ ] DeepWalk
[ ] Poincare

Point Cloud

[ ] EdgeConv

VoVAllen commented 5 years ago

I think drug model zoo can be a good starting point for something like python examples/recsys/train.py --dataset=movielens100k --model=gcmc. Also I think it's time to schedule the next release.

VoVAllen commented 5 years ago

Also I think nightly build should be implemented before next release.

mufeili commented 5 years ago

I think there are two key issues to be addressed.

We need to decide if examples and model zoos are going to be the same. I agree that we need to make the code more reusable like layers, loss functions, but I'm a bit reluctant to make the two things completely the same. In my opinion, examples are more like entry points for accessing a new framework. They can be more model driven or simply demos for a task. With 200 lines of code in 2-3 files, one can quickly understand how to implement some known models with new APIs. On the contrary, model zoos are application driven and can be deeply wrapped. Code can be easily scattered over more than 10 files and they probably also include some application specific processing. If I want to benchmark models or just want to apply them, I will turn to particular model zoos. But if I'm an engineer or researcher who just got interested in GNNs or only interested in adapting models for my own use, I'd rather start with examples. PyTorch has separate examples and model zoo (e.g. torchvision).
We need to decide if model zoos/application driven staff will be included in dgl or stand alone. For pure graph based operation we should definitely include them in dgl, but when it comes to applications we can easily get more complex and even graph free staff like data processing, featurization, evaluation, additional dependency libraries etc. For example, in Chemistry we mostly represent each molecule as a string; we need to featurize atoms and bonds; we cannot even breath without libraries like RDKit. To include everything in dgl will not only be difficult but also make the library very heavy. I personally think something like torchvision will be more realistic.

Minor issues:

What you referred as model zoo sounds more like blocks/layers.
nightly build is definitely needed. In addition to the long release cycle, a common mistake is that if you did not uninstall the old version, you might still import it even if you build from source for a new version.

jermainewang commented 5 years ago

@mufeili You are right about model zoo. I changed the title and text to be more accurate. This RFC is about dgl.nn, dgl.data and examples folder. I saw you've opened an roadmap for drug model zoos, so the relevant discussion should be moved there. Now the question is whether we should unify the entry point or not? python examples/recsys/train.py --dataset=movielens100k --model=gcmc seems to be more suitable for model zoo?

Another related question is whether examples should demonstrate the usage of message passing APIs? The current proposal suggests relying mostly on dgl.nn which then utilizes message passing APIs.

mufeili commented 5 years ago

@jermainewang @VoVAllen We are now doing python examples/pytorch/model_zoo/chem/property_prediction/classification.py --dataset X --model Y. We should probably reduce the number of hierarchies, but the general idea seems to be fine here.

I think we can completely rely on dgl.nn in examples and only demonstrate the usage of message passing APIs and probably how they are used in an nn.Module in tutorials. However, we should make it more clear about what are the minimal tutorials to check for new users.

VoVAllen commented 5 years ago

I don't think we need to unify the entry point at current stage.

dmlc / dgl