New GNN model: Boosted Graph Neural Network (BGNN)

nd7141 commented 3 years ago

🚀 Feature

A proposal of a new GNN model coming from ICLR 2021: https://openreview.net/forum?id=ebS5NUfoMKL The model is implemented with DGL: https://github.com/nd7141/bgnn

Motivation

This is the first GNN designed to work well on graphs when node features are heterogeneous. Heterogeneous means that each feature has some individual meaning. For example, in a social network each person can have age, income, gender, graduation date, etc. as features. On the other hand, previous GNNs perform well when the node features are homogeneous. For example, node features are pretrained word embeddings or bag-of-words features.

Pitch

What I proposed in the paper is not a new GNN layer, but a whole model, rather a combination of GNN and GBDT models. Ideally, the end user will do something like:

from dgl.models import BGNN

bgnn = BGNN(*params)
bgnn.fit(graph, node_features, target_labels)

Additional context

The model currently works for node classification and regression tasks.

The model is implemented with DGL: https://github.com/nd7141/bgnn dgl_cu92==0.5.3 is tested.

Importantly, BGNN relies on installed GBDT package. I tested it with CatBoost package (https://catboost.ai/). I feel like it's easier to install CatBoost across all OS platforms rather than LightGBM or XGBoost (those are possible options too, but I haven't tested them). So CatBoost can be part of optional dependencies for users who want to use BGNN.

jermainewang commented 3 years ago

Hi Sergei, awesome work and really appreciate the support! Overall, there are three ways to add a new model to DGL.

Add it to dgl.nn as one of the built-in NN modules. It is common for work that propose new GNN layers and mainly aims at providing building blocks for new architecture.
Add it to be one of the examples under the examples folder. The examples are more diverse than the NN modules. They are python scripts that train models end-to-end. They aims at providing users a base for trying out different ideas on existing work or benchmarking the model on different datasets. As a result, we recommend to follow the practice here to make them more consistent.
Add it to a model zoo such as the one in DGL-LifeSci. They usually include a set of pretrained parameters and could produce an accurate prediction using one line of code.

Your paper presents an interesting case. From my understanding (please correct me if I'm wrong), BGNN is a new methodology for training GNNs by combining GBDT. BGNN does not care what GNN model is used as its component. This nature excludes the first and third routes. Having BGNN hosted in DGL's official example set is definitely welcomed. I can see merits such as more consistent README and coding style or shorter implementation (if we focus only one some of the data points in your paper), so researchers could potentially follow up your work more quickly.

Another strategy I'm thinking about is similar to what you proposed -- exposing BGNN as a built-in optimizer in DGL so others can use your algorithm to optimize their own GNN models. For that, I have three questions:

In the ideal world, it is better to make it look like PyTorch's optimizer so others can learn it with little burden. However, it is not clear how the Alg.1 of your paper can be re-written in a such a way. Maybe it's not possible?
Is it possible to make your algorithm agnostic to the GBDT component? This is related to whether we could avoid a hard dependency.
To me, the framework seems to work on non-GNN model as well (like CNN/RNN). Is it correct?

Besides, your experiments use some datasets that DGL is missing (House, County, VK, Avazu). We would really appreciate it if you'd like to contribute them to DGL as well. We have a dedicated user guide chapter about how to do that.

lvjiujin commented 3 years ago

In your github, if one wants to change the dataset to his or her own dataset, the format is very complex, Can you simplify the process of the dataset's format?

jermainewang commented 3 years ago

In your github, if one wants to change the dataset to his or her own dataset, the format is very complex, Can you simplify the process of the dataset's format?

Hi @lvjiujin , your question is off-topic. Would you please raise another issue on it so we could move the discussion there? I will hide your comment and mine.

lvjiujin commented 3 years ago

I said that bgnn model which needed a special datasets format, how do you understand my meanings ?

jermainewang commented 3 years ago

I thought you were asking whether DGL could further simplify its data preprocessing logic. This thread is more about discussing how to add BGNN model into DGL. But since I mentioned contributing datasets as well, I will see this as a community request for better dataset interface to use together with BGNN. I would encourage @nd7141 to take a look at our dataset pipeline mentioned in the above thread to see whether that's a fit or not.

lvjiujin commented 3 years ago

Ok, I see, I know, DGL's dataset format is much easier than pyG, This is dgl's advantages .

nd7141 commented 3 years ago

Hi @jermainewang

Thanks for the update!

I would say adding it as an example is a plan minimum. It's definitely doable and very easy: just copy/paste part of the code that I have. I thought that having BGNN model as part of dgl.models would be a better option as it would hide all of the complexity of the BGNN model, but I don't know if you have such dgl.models.

I'm not sure I understood the point of having BGNN as an optimizer, as Algorithm 1 includes training of GBDT and GNN model. So it's not that you can just pass model's parameters and it will learn them (you actually need model's architecture).

Is it possible to make your algorithm agnostic to the GBDT component?

GBDT component is necessary, but what type of GBDT to use is open. I use catboost, but other possibilities are LightGBM, XGboost, maybe something from sklearn library (but I'm not sure). BGNN is also agnostic to the type of GNN model.

the framework seems to work on non-GNN model as well (like CNN/RNN). Is it correct?

Yes, potentially it's possible to use with any NN model that is trained by gradient descent.

We would really appreciate it if you'd like to contribute them to DGL as well. We have a dedicated user guide chapter about how to do that.

I will look into how I can move the datasets. However, one important caveat is that node features may include categorical features (not all datasets but some). In this case, it's not clear how to include these cat features as part of the numerical tensor. One way to do so is through preprocessing. But, the whole point of BGNN is that it does not do any preprocessing of input features, but rather provides a CatBoost prediction (a number) to GNN model. Since CatBoost can operate on categorical features, there is no need to make it part of the numerical features. But again, I will look into how I can make it part of the library.

nd7141 commented 3 years ago

@lvjiujin I agree that the format is somewhat custom. One reason why I didn't do it part of the dgl library is that node features are of various type (int vs str), which is not clear if dgl can support it. Second reason is that I may need original input features only for GBDT part. GBDT itself does not care about the graph, so it's worth separating node features and graph. Third reason is that I wanted it to be agnostic to library. I have tried it with PyG and it works, but the final version uses DGL. But I agree that it may not the best choice if you want to experiment with existing DGL datasets. I will think on possibility of changing the format of the datasets.

lvjiujin commented 3 years ago

@lvjiujin I agree that the format is somewhat custom. One reason why I didn't do it part of the dgl library is that node features are of various type (int vs str), which is not clear if dgl can support it. Second reason is that I may need original input features only for GBDT part. GBDT itself does not care about the graph, so it's worth separating node features and graph. Third reason is that I wanted it to be agnostic to library. I have tried it with PyG and it works, but the final version uses DGL. But I agree that it may not the best choice if you want to experiment with existing DGL datasets. I will think on possibility of changing the format of the datasets.

you have tried it with PyG and it works, but the final version uses DGL. why? don't you think so that PyG is much easier than DGL? I prefer PyG.

yzh119 commented 3 years ago

@lvjiujin would you mind elaborating more on "don't you think so that PyG is much easier than DGL?"? We will keep tuning user experience. btw, this paper uses GAT and DGL's GAT is much better in terms of speed and GPU memory usage over PyG's.

jermainewang commented 3 years ago

@lvjiujin @yzh119 Your discussions are off-the-topic. If the points are about the general user experience of PyG vs DGL, pls open another issue for discussion. Let's focus on how to add BGNN to DGL in this thread. Thanks.

BarclayII commented 3 years ago

@jermainewang I can see some "sub"-issues going on here:

None of DGL's current models and data pipeline directly handle tabular data (e.g. pandas DataFrames). DGLGraph has the potential of handling tabular data since the node/edge features are dictionaries. The problem then more or less becomes whether we should allow DGLGraph to directly convert node/edge features to/from e.g. pandas DataFrames.
BGNN's optimization involves alternating between training GBDTs (which is not a gradient descent based method) and training GNN with gradient descent. An example might be too heavy while the training style of BGNN does not really fit as a DGL NN module. So I guess if we want an end-to-end module BGNNOptimizer, it will probably have to have scikit-learn's fit-and-predict interface:
```
from dgl.optimizers import BGNNOptimizer  # namespace name is just placeholder
optim = BGNNOptimizer(
   # The GNN model
   gnn_model=GATConv(10, 20, num_heads=5),
   # The first GBDT for either classification or regression
   gbdt_task=CatBoostClassifier,
   # The later GBDTs are always regression
   gbdt_intermediate=CatBoostRegressor)
optim.fit(graph, train_dataframe, target_labels)
result = optim.predict(graph, test_dataframe)
# Get the output GBDTs and GNN model
gbdts, gnn_model = optim.gbdts, optim.gnn_model
```
The optimizer above does not depend on any third-party GBDT packages since we pass in the GBDT class as an argument, so no extra dependency is involved. One problem I can anticipate is that maybe different GBDT packages have different interfaces to train. Would like to hear more from @nd7141 on this.

jermainewang commented 3 years ago

Thanks @BarclayII for calling out. The user experience makes sense to me. I think the compatibility of different GBDT packages is less a concern. One follow up question. Many graph data have no node feature. A common practice is to have an embedding layer that is trained together with the GNN parameters. How would BGNN adapt to this case? It's ok that BGNN cannot support it and we could emphasize that in the doc if that's the case.

jermainewang commented 3 years ago

Another suggestion is to allow customizing the loss function L. For example, users may want to try out BGNN for link prediction or graph classification too.

nd7141 commented 3 years ago

@BarclayII Your code makes sense to me. I also agree that different GBDT packages can have different interfaces. I guess one option would be to choose CatBoost as a GBDT model (though it would require making it an (optional) dependency). One reason advocating for CatBoost is that (a) it is easy to install and use, (b) it works well with categorical features, and (c) has sum_models which is used inside BGNN and allows BGNN to merge two models into one (probably it exists for other packages, I just haven't found it immediately). So overall I like the following syntax:

from dgl.optimizers import BGNNOptimizer  # namespace name is just placeholder
optim = BGNNOptimizer(
    # The GNN model
    gnn_model=GATConv(10, 20, num_heads=5))
optim.fit(graph, train_dataframe, target_labels)
result = optim.predict(graph, test_dataframe)
# Get the output GBDTs and GNN model
gbdts, gnn_model = optim.gbdts, optim.gnn_model

The users can choose any GNN model they want, while GBDT part will be taken care of inside the function. What do you think about it?

nd7141 commented 3 years ago

@jermainewang

Many graph data have no node feature. A common practice is to have an embedding layer that is trained together with the GNN parameters. How would BGNN adapt to this case?

BGNN should not be used in this case, as the whole point of BGNN is to preprocess node features better than NN can do. So if there are no node features, standard GNN should be used.

Another suggestion is to allow customizing the loss function L. For example, users may want to try out BGNN for link prediction or graph classification too.

Yes, we can give users the choice of the loss function. However, for graph classification it's not clear what GBDT component should predict. Now, GBDT takes (node_features, node_label) and trains the model with it. In the case of graph classification, there are no node_labels so it's not obvious what the target should be. In theory it's possible to have some proxy such as degree of a node, but we haven't tested it yet.

BarclayII commented 3 years ago

What do you think about it?

Makes perfect sense. An optional dependency to CatBoost is acceptable.

for graph classification it's not clear what GBDT component should predict

I assume that's also the case for link prediction as well? What would I set the target for GBDT if the task is link prediction?

nd7141 commented 3 years ago

I assume that's also the case for link prediction as well? What would I set the target for GBDT if the task is link prediction?

For link prediction it's possible to make (aggregate([node_A_features, node_B_features]), link_target), where aggregate is sum, concat or something else. Then, in GNN it's important to take gradient related to the edge, it could be either edge embedding gradient or some form of aggregate(node_A_gradient, node_B_gradient). But we haven't tested it as well and in general it's an open question if the performance will be better.

jermainewang commented 3 years ago

Given the discussion per se, I think we reach a consensus on

The fit and predict -style interface for BGNNOptimizer.
Having CatBoost as an optional dependency (only needed if the user wants to use BGNNOptimizer).
It is still an open question that whether BGNN is effective for link prediction and graph classification tasks. It is more reasonable to focus on node classification tasks under the case that node features are available. That is said, if we want to allow customizing loss functions, we can assume that it is used for node classification tasks.

I think the next step is to decide the interface details and the corresponding docstring of the BGNNOptimizer. After that, we could move on to the PR stage.

Here is a proposal. I change the name to BGNNClassifier. @nd7141 pls feel free to fill in the details.

class BGNNClassifier:
    """Node classifier using the Boosted Graph Neural Network (BGNN) algorithm.

    <some descriptions, paper link, etc etc>

    Parameters
    ----------
    gnn_model : torch.nn.Module
        The GNN model to optimize.
    gnn_optim : torch.optim.Optimizer
        The optimizer for optimizing the GNN model by gradient descent.
    loss_fn : callable, optional
        The loss function of the prediction task. Default: cross entropy.

    Attributes
    ----------
    gbdts : ?
        The GBDTs in this classifier.
    gnn_model : torch.nn.Module
        The GNN model in this classifier.
    """
    def __init__(self, gnn_model, gnn_optim, loss_fn = ...):
        pass

    def fit(self, graph, train_dataframe, target_labels):
        """Fit the classifier to the provided training data and labels.

        Parameters
        ----------
        graph : dgl.DGLGraph
            ...
        train_dataframe : pandas.DataFrame
            ...
        target_labels : ?
        """
        pass

    def predict(self, graph, test_dataframe):
        """Make prediction on the test data.

        Parameters
        ----------
        graph : dgl.DGLGraph
            ...
        test_dataframe : pandas.DataFrame
            ...

        Returns
        -------
        ?
        """
        pass

nd7141 commented 3 years ago

Thanks @jermainewang It looks good to me.

I would maybe change the name from BGNNClassifier to BGNNPredictor, because it can be used for both node classification and node regression. Which type of the problem users want to solve could be an optional parameter with some default value.

I'm not sure about the loss_fn, I guess it could be chosen as a parameter, I just haven't emperimented with different loss functions. What are other losses do you have in mind? So far I had cross entropy for classification, and RMSE for regression.

I guess there are some other attributes, such as best_epoch to select the best performance metric, but it's more of details.

Do you want me to fill in the gaps in this code proposal?

jermainewang commented 3 years ago

Thanks @jermainewang It looks good to me.

I would maybe change the name from BGNNClassifier to BGNNPredictor, because it can be used for both node classification and node regression. Which type of the problem users want to solve could be an optional parameter with some default value.

agree

I'm not sure about the loss_fn, I guess it could be chosen as a parameter, I just haven't emperimented with different loss functions. What are other losses do you have in mind? So far I had cross entropy for classification, and RMSE for regression.

Examples are regularizers for skewed multiclass classification (e.g., focal loss), hybrid loss combining classification and reconstruction objectives, etc. The idea here is to expose an option for users to configure them. In my mind, It could be just a python function with two arguments: prediction from GNN and target labels provided via fit, and return a loss value. Something like the following:

def my_loss_function(pred, target_labels):
    l = ... # can do anything here
    return l

predictor = BGNNPredictor(
    # The GNN model
    gnn_model=GATConv(10, 20, num_heads=5),
    loss_fn=my_loss_function)

I guess there are some other attributes, such as best_epoch to select the best performance metric, but it's more of details.

Do you want me to fill in the gaps in this code proposal?

Yes please go ahead.

jermainewang commented 3 years ago

Hi @nd7141 , any update on this?

classicsong commented 3 years ago

@nd7141 Can you share the raw datasets and steps to clear the datasets? In your repo (https://github.com/nd7141/bgnn) the datasets are already pre-processed.

nd7141 commented 3 years ago

@jermainewang Sorry I was on vacation and will probably be busy the following week. It's no rush for me, but to understand the timeline I think to come back to this early March. Is it fine?

nd7141 commented 3 years ago

@classicsong Each dataset had its own input format that I had to preprocess independently, so there is no uniform preprocessing step. In general, it should be quite easy to preprocess once you have a graph with the node features: make a separate dataframe with node features and another one for target labels.

jermainewang commented 3 years ago

@jermainewang Sorry I was on vacation and will probably be busy the following week. It's no rush for me, but to understand the timeline I think to come back to this early March. Is it fine?

Sure. Let's come back to this early March.

jermainewang commented 3 years ago

Hi @nd7141 , do you think it's a good time to resume this effort? We'd love to proceed and have BGNN in DGL :)

nd7141 commented 3 years ago

Yes, definitely. I will look at it next week.

nd7141 commented 3 years ago

I created a pull request: https://github.com/dmlc/dgl/pull/2740 Let's move discussion there.

dmlc / dgl