Closed jermainewang closed 4 years ago
I think drug model zoo can be a good starting point for something like python examples/recsys/train.py --dataset=movielens100k --model=gcmc
.
Also I think it's time to schedule the next release.
Also I think nightly build should be implemented before next release.
I think there are two key issues to be addressed.
dgl
or stand alone. For pure graph based operation we should definitely include them in dgl, but when it comes to applications we can easily get more complex and even graph free staff like data processing, featurization, evaluation, additional dependency libraries etc. For example, in Chemistry we mostly represent each molecule as a string; we need to featurize atoms and bonds; we cannot even breath without libraries like RDKit. To include everything in dgl will not only be difficult but also make the library very heavy. I personally think something like torchvision will be more realistic.Minor issues:
@mufeili You are right about model zoo. I changed the title and text to be more accurate. This RFC is about dgl.nn
, dgl.data
and examples folder. I saw you've opened an roadmap for drug model zoos, so the relevant discussion should be moved there. Now the question is whether we should unify the entry point or not? python examples/recsys/train.py --dataset=movielens100k --model=gcmc
seems to be more suitable for model zoo?
Another related question is whether examples should demonstrate the usage of message passing APIs? The current proposal suggests relying mostly on dgl.nn
which then utilizes message passing APIs.
@jermainewang @VoVAllen We are now doing python examples/pytorch/model_zoo/chem/property_prediction/classification.py --dataset X --model Y
. We should probably reduce the number of hierarchies, but the general idea seems to be fine here.
I think we can completely rely on dgl.nn
in examples and only demonstrate the usage of message passing APIs and probably how they are used in an nn.Module in tutorials. However, we should make it more clear about what are the minimal tutorials to check for new users.
I don't think we need to unify the entry point at current stage.
Problem
There have been a bunch of requests and efforts (PR #744 #748 #753 Issue #719 #742) around our modules, examples and dataset. This RFC would like to discuss some details and write out the roadmap. I know there has been some internal discussions among @mufeili and @VoVAllen about the drug/chemical dataset/applications. I hope you could also summarize it a bit and improve this RFC. This RFC is likely to be broken down into several PRs (for example, #748 is a good start), so having a roadmap is quite important to track the progress.
The problem of the current DGL examples is that (1) Reusable modules are scattered in many training scripts, making them hard to be accessed by the end users. (2) We need to add way more modules and layers due to the increasing demand from the community. This has severely impeded DGL from more adoption. (3) The organization is not clean. Each example is put in its dedicated folder. We should have a better hierarchy to organize them (possibly by application type). (4) Example codes are not included in CI, making it hard to maintain.
Proposal
To fix this, this RFC proposes the following changes: (1) Move as many layer/module components as possible from examples to
nn
namespace. These modules should be functional, has graph object in the argument to allow dynamic graph, and has clear documentation. We also need to implement them for both mxnet and pytorch. (2) Try organize modules in the following categories.conv.py
)glob.py
,seq.py
)loss.py
metrics.py
...
Note that we could include any utilities that may not depend on DGL. They could be evaluation metrics, loss, score or any functions that are common and useful. However, they are still in
dgl.nn
namespace. (3) Curate more datasets. The question here is whether we are confident that our data format is able to cover all of them. @VoVAllen may have more opinions on this. Otherwise, we will continue the current practice. If using raw dataset, we could directly download it from the official site and construct graphs online. Otherwise, we may need to preprocess the data to generate graphs and store it in our own data format. (4) Reorg the examples by applications:Suggestion from @classicsong: It's better that we could have a unified entry point of each. For example:
This allows the users to easily compare baselines.
Finally, it is a good time to finally refactor and write more docstrings for these modules. We could also take this chance to enable lint check on
dgl.data
module and move sampling routines to DGL main packages.Check List
@mufeili @VoVAllen @yzh119 @zheng-da and anyone who are involved. Please directly edit this post to add items. We could estimate the workload from this.
dgl.nn
Graph convolution (
dgl.nn.<backend>.conv.py
)node-level classification/regression TBD
Link prediction
Graph generation
loss.py
TBDmetrics.py
TBDFunctional APIs (
dgl.nn.functional
) TBDdgl.data Community detection
Graph classification
Recommender system
Network embedding
Examples Community detection
Graph classification
Recommender system
Network embedding
Point Cloud