Weird behavior when using GraphConv with norm=right

iamgroot42 commented 3 years ago

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Define model with GraphConv layer and set norm=right
Train model and evaluate error/metrics on train data
Metrics logged while training improves as expected, but with the same data and model under model.eval() gives near-random performance
Re-run the same code, but with removed norm=right
As expected, evaluating metrics on train data shows improvement.

From what I can gather, setting norm='right' introduces some form of error somehow (which doesn't make a lot of sense, after a brief look at the implementation). The model itself does not have any sources of non-determinism like Dropout either, so that part is ruled out as well.

Also, the error goes away if I do not set the model to evaluation mode (and let it stay in train mode) while evaluating, which doesn't make any sense: the only difference between the two for this model would be gradient accumulation.

Code snippet to reproduce

from dgl.nn.pytorch import GraphConv
import torch.nn as nn
import torch.optim as optim
import torch as ch
from tqdm import tqdm

class GCN(nn.Module):
    def __init__(self, n_inp, n_hidden, n_layers, n_classes=2, residual=False):
        super(GCN, self).__init__()
        self.layers = nn.ModuleList()
        self.residual = residual

        # input layer
        self.layers.append(
            GraphConv(n_inp, n_hidden, norm='right'))
            # GraphConv(n_inp, n_hidden))

        # hidden layers
        for i in range(n_layers-1):
            self.layers.append(
                GraphConv(n_hidden, n_hidden, norm='right'))
                # GraphConv(n_hidden, n_hidden))

        # output layer
        self.final = GraphConv(n_hidden, n_classes, norm='right')
        # self.final = GraphConv(n_hidden, n_classes)
        self.activation = nn.ReLU()

    def forward(self, g, latent=None):

        if latent is not None:
            if latent < 0 or latent > len(self.layers):
                raise ValueError("Invald interal layer requested")

        x = g.ndata['feat']
        for i, layer in enumerate(self.layers):
            xo = self.activation(layer(g, x))

            # Add prev layer directly, if requested
            if self.residual and i != 0:
                xo = self.activation(xo + x)

            x = xo

            # Return representation, if requested
            if i == latent:
                return x

        return self.final(g, x)

def true_positive(pred, target):
    return (target[pred == 1] == 1).sum().item()

def get_metrics(y, y_pred, threshold=0.5):
    y_ = 1 * (y_pred > threshold)
    tp = true_positive(y_, y)
    precision = tp / ch.sum(y_ == 1)
    recall = tp / ch.sum(y == 1)
    f1 = (2 * precision * recall) / (precision + recall)

    precision = precision.item()
    recall = recall.item()
    f1 = f1.item()

    # Check for NaNs
    if precision != precision:
        precision = 0
    if recall != recall:
        recall = 0
    if f1 != f1:
        f1 = 0

    return (precision, recall, f1)

# @ch.no_grad()
def lmao(model, loader, gpu):
    loss_func = nn.CrossEntropyLoss()

    tot_loss, precision, recall, f1 = 0, 0, 0, 0
    iterator = enumerate(loader)
    iterator = tqdm(iterator, total=len(loader))

    for e, batch in iterator:

        # Shift graph to GPU
        if gpu:
            batch = batch.to('cuda')

        # Get model predictions and get loss
        labels = batch.ndata['y'].long()
        logits = model(batch)
        loss = loss_func(logits, labels)
        probs = ch.softmax(logits, dim=1)[:, 1]

        # Get metrics
        m = get_metrics(labels, probs)
        precision += m[0]
        recall += m[1]
        f1 += m[2]

        tot_loss += loss.item()
        iterator.set_description(
            "Loss: %.5f | Precision: %.3f | Recall: %.3f | F-1: %.3f" %
            (tot_loss / (e+1), precision / (e+1), recall / (e+1), f1 / (e+1)))
    return tot_loss / (e+1)

def epoch(model, loader, gpu, optimizer=None, verbose=False):
    loss_func = nn.CrossEntropyLoss()
    is_train = True
    if optimizer is None:
        is_train = False

    tot_loss, precision, recall, f1 = 0, 0, 0, 0
    iterator = enumerate(loader)
    if verbose:
        iterator = tqdm(iterator, total=len(loader))

    with ch.set_grad_enabled(is_train):
        for e, batch in iterator:

            if gpu:
                # Shift graph to GPU
                batch = batch.to('cuda')

            # Get model predictions and get loss
            labels = batch.ndata['y'].long()
            logits = model(batch)
            loss = loss_func(logits, labels)

            with ch.no_grad():
                probs = ch.softmax(logits, dim=1)[:, 1]

            # Backprop gradients if training
            if is_train:
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            # Get metrics
            m = get_metrics(labels, probs)
            precision += m[0]
            recall += m[1]
            f1 += m[2]

            tot_loss += loss.detach().item()
            if verbose:
                iterator.set_description(
                    "Loss: %.5f | Precision: %.3f | Recall: %.3f | F-1: %.3f" %
                    (tot_loss / (e+1), precision / (e+1), recall / (e+1), f1 / (e+1)))
    return tot_loss / (e+1)

def train_model(net, ds, args):
    train_loader, test_loader = ds.get_loaders(1, shuffle=False)
    optimizer = optim.Adam(net.parameters(), lr=args.lr)

    for e in range(args.epochs):
        # Train
        print("[Train]")
        net.train()
        epoch(net, train_loader, args.gpu, optimizer, verbose=args.verbose)

        # Test
        print("[Eval]")
        net.eval()

        epoch(net, train_loader, args.gpu, None, verbose=args.verbose)
        print()

Expected behavior

Loss/metrics keep improving as the model is trained, so re-evaluating them on the SAME data should indeed show similar performance. Instead, the performance logged while training keeps on improving while checking performance on the same dataset and model in the evaluation model leads to near-random performance. Example of what I'm talking about (evaluation is also done on train data):

Environment

DGL Version: 0.6.1
Backend Library & Version: PyTorch 1.7.1
OS: Linux
How you installed DGL: pip
Python version: 3.6.10
CUDA/cuDNN version: 10.1/7.5.0
GPU models and configuration: NVidia Quadro RTX 4000

Additional context

Error persists without GPU as well (training on CPU)

Rhett-Ying commented 3 years ago

Here's a straight-forward example for GCN with GraphConv: https://github.com/dmlc/dgl/tree/master/examples/pytorch/gcn. I tried to specify norm as 'right' in 'gcn.py', evaluation improvement is expected. didn't hit the issue you mentioned. could you take a look at this?

how may batches in your case(the last for loop in the code snippet below)? only 1?

  def epoch(model, loader, gpu, optimizer=None, verbose=False):
      loss_func = nn.CrossEntropyLoss()
      is_train = True
      if optimizer is None:
          is_train = False

      tot_loss, precision, recall, f1 = 0, 0, 0, 0
      iterator = enumerate(loader)
      if verbose:
          iterator = tqdm(iterator, total=len(loader))

      with ch.set_grad_enabled(is_train):
          for e, batch in iterator:

iamgroot42 commented 3 years ago

I tried this architecture (with 'right' norm specified) as well and got the same behavior. Regarding batch size: I tried 1, 2, and 4, and got the same issue with all of them.

Rhett-Ying commented 3 years ago

I tried this architecture (with 'right' norm specified) as well and got the same behavior.

this architecture here means the model in dgl example on your dataset? or the whole program including main func?

iamgroot42 commented 3 years ago

On my dataset (using the code I provided above). I even tried replacing the data loader with a basic list of graphs to rule out any issues that could have crept in because of a faulty data loader. Even this change does not help at all. It is a bit mind-boggling to me why this issue only appears with norm='right' is used instead of 'both', since there's isn't THAT much of a difference in the model itself?

Rhett-Ying commented 3 years ago

could you paste the dataloader with a basic list of graphs you just mentioned ? then I could repro it

iamgroot42 commented 3 years ago

It's not a standard dataset. Here's the link: https://github.com/harvardnlp/botnet-detection

Rhett-Ying commented 3 years ago

which dataset are you working on? 'chord' or others?

iamgroot42 commented 3 years ago

'chord'

Rhett-Ying commented 3 years ago

I tried with below configs using your code/model, but the precision is always zero. could you share your configs?

g = botnet_dataset_train[0] in_feats = g.ndata['x'].shape[1] n_hidden = 16 n_layers = 2 n_classes = 2 n_epochs = 5

[Train]
Loss: 0.25807 | Precision: 0.000 | Recall: 0.003 | F-1: 0.000: 100%|█| 384/384 [00:34<00:00 ------- total cnt in dataloader: 0 [Eval]
Loss: 0.25234 | Precision: 0.000 | Recall: 0.000 | F-1: 0.000: 92%|▉| 354/384 [00:26<00:02

iamgroot42 commented 3 years ago

The models take a few epochs to start learning, so precision/recall will stay close to 0 for the first 4-5 epochs.

Rhett-Ying commented 3 years ago

I tried to run more epochs, but precision stays at 0.0 even in 15th epoch

iamgroot42 commented 3 years ago

That's strange. Here's the configuration I used:

n_hidden = 32 n_layers = 6

Rhett-Ying commented 3 years ago

I tried to modify the code/model under //dgl/examples/pytorch/gcn/train.py to mock your code/model. the remained main difference is the precision logic. I used below logic which generates 0.93+ precision. does it make sense?

      for e,batch in enumerate(dataloader):
            features = batch.ndata['x']
            labels = batch.ndata['y'].long()
            logits = model(batch)
            loss = loss_fcn(logits, labels)
            _, indices = torch.max(logits, dim=1)
            correct = torch.sum(indices == labels)

            if isTrain:
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            c_acc = correct.item() * 1.0 / len(labels)
            print("------- current acc: {}, batch_id: {}, batch_size: {}".format(c_acc, e, batch.num_nodes()))

------- current acc: 0.9307836330978824, batch_id: 289, batch_size: 288949 ------- current acc: 0.9291975247454651, batch_id: 290, batch_size: 282476 ------- current acc: 0.9288959676903277, batch_id: 291, batch_size: 281278 ------- current acc: 0.9310556415444915, batch_id: 292, batch_size: 290089 ------- current acc: 0.9306107296628723, batch_id: 293, batch_size: 288229

iamgroot42 commented 3 years ago

This seems to be computing accuracy, which would be quite high if the model learns to always say 0 (since the data is heavily unbalanced), which is why I chose to look at precision/recall/F-1 score)

Rhett-Ying commented 3 years ago

I tried with your config (n_hiddens=32, n_layers=6), precision stays at 0.0 even more epochs runs. But here I set GraphConv(norm='right').

If I change to norm='both', precision becomes much better in the beginning epochs. But it fluctuates/drops a lot. Precision could drop to 0.000 and seems stay at 0.000 then. Is such behavior consistent with yours? Have u ever hit such issue?

[Train]~Epoch_0 Loss: 0.14862 | Precision: 0.712 | Recall: 0.530 | F-1: 0.579: 100%|█| 384/384 [00:53<00:00 [Eval]~Epoch_0 Loss: 0.04104 | Precision: 0.959 | Recall: 0.914 | F-1: 0.936: 100%|█| 384/384 [00:41<00:00

[Train]~Epoch_1 Loss: 0.03022 | Precision: 0.975 | Recall: 0.938 | F-1: 0.956: 100%|█| 384/384 [00:53<00:00 [Eval]~Epoch_1 Loss: 0.02302 | Precision: 0.986 | Recall: 0.945 | F-1: 0.965: 100%|█| 384/384 [00:41<00:00

[Train]~Epoch_2 Loss: 0.18103 | Precision: 0.399 | Recall: 0.305 | F-1: 0.322: 100%|█| 384/384 [00:53<00:00 [Eval]~Epoch_2 Loss: 0.25238 | Precision: 0.000 | Recall: 0.000 | F-1: 0.000: 100%|█| 384/384 [00:41<00:00

[Train]~Epoch_3 Loss: 0.25235 | Precision: 0.000 | Recall: 0.000 | F-1: 0.000: 100%|█| 384/384 [00:51<00:00 [Eval]~Epoch_3 Loss: 0.25236 | Precision: 0.000 | Recall: 0.000 | F-1: 0.000: 100%|█| 384/384 [00:40<00:00

[Train]~Epoch_4 Loss: 0.25235 | Precision: 0.000 | Recall: 0.000 | F-1: 0.000: 100%|█| 384/384 [00:51<00:00 [Eval]~Epoch_4 Loss: 0.25234 | Precision: 0.000 | Recall: 0.000 | F-1: 0.000: 100%|█| 384/384 [00:41<00:00

[Train]~Epoch_5 Loss: 0.25235 | Precision: 0.000 | Recall: 0.000 | F-1: 0.000: 100%|█| 384/384 [00:52<00:00 [Eval]~Epoch_5 Loss: 0.25235 | Precision: 0.000 | Recall: 0.000 | F-1: 0.000: 100%|█| 384/384 [00:40<00:00

iamgroot42 commented 3 years ago

For the norm='both' case: I didn't observe any such fluctuations- my metrics stayed around 0.9+ pretty consistently. For the norm='right' case: What I observed (the image I attached) was that while training metrics logged would be great, but looking at them in eval mode would show near-random performance. That is the part that I found most concerning- a drastic difference in performance on the same data of the same model

Rhett-Ying commented 3 years ago

Quick questions:

have tried to eval with test_dataloader on the model trained by GraphConv(norm='right')? the results is random low?
the issue exists in any batch_size?
I still cannot repro your issue completely with norm='right'. Precision is always 0.000 even in train epoch. I tried with batch_size_1/2/4/8. norm='both' works well. so could you share your whole code then I could try to repro?

BTW,

have you ever checked below ticket? It's a similar one though not based on DGL and several years ago. As the moderator pointed out, the degrade issue is probably caused by unstable model. https://discuss.pytorch.org/t/performance-highly-degraded-when-eval-is-activated-in-the-test-phase/3323/16

So I am wondering the issue u hit may be related to the model stability (which could be caused by GraphConv(norm='right'), batch_size) or even the metric you're using for precision/recall. And the y==1 is very low compared to y==0 in each graph. I we count the prediction of y==0 into your precision, the model seems work good even with norm='right'.

So if I say: train with GraphConv(norm='right') and measure with your metrics(count only y==1) on a dataset that's heavily unbalanced(len(y==0) >> len(y==1)) is not a proper way, trained model is not stable. Does it make sense?

iamgroot42 commented 3 years ago

Yes, the results were low. My main concern is not with performance on eval data being low- instead, I am surprised by the model having near-random performance on train data when run in eval mode.
Yes- I tried bs=1 (what the authors used), 2, 4, and 8.
Sure! I've added it to a gist here - hope it helps! Running python gist.py --norm both gives an F-1 score of ~0.7 after 2 epochs on both train and test data. However, python gist.py --norm right exhibits the problem I opened this Github issue for:

Regarding the ticket: the issue in the discussion on that thread is related to the working of batch-norm layers. It is not applicable here, since this model does not have any such normalization layers.

As far as evaluation goes, the label=1 class here is the one we care about (and the one in minority). Thus, by definition, the computation for precision/recall should be based on y==1. Here is a reference to the original repository that also uses y==1 for their computations.

BTW I really appreciate all the time and effort you're putting into resolving this :) thank you

Rhett-Ying commented 3 years ago

DROPPUT, DROPOUT, DROPOUT

Finally, issue could be reproduced in my side. The reason why I cannot repro is no dropout is configured in the code snippet you pasted at the top of this post. Dropout is configured in the 'gist.py' you just shared.

As for the issue, I'd like to blame dropout which is the main difference between model.train() and model.eval(). Why do I blame to dropout? If dropout=0.0 when calling model.train() with norm='right', the precision is always ~0.000(this is what I reproduced before), not mention model.eval() on train_loader and test_loader. In other words, if dropout=0.0, model.train() is almost same as model.eval() because no dropout at all. But if dropout=0.5, this takes effect in model.train() which obtains good precision(>0.7) while no dropout at all in model.eval() which results in 0.000 precision.

In short, model is vulnerable and sensitive to dropout if norm='right'. If norm='both', model is more robust and less sensitive to dropout even dropout=0.0, according to my experiment.

I think we'd better train with GraphConv(norm='both'), dropout=0.5 to obtain a robust model in this scenario.

iamgroot42 commented 3 years ago

Ah, I see - I removed dropout to help with debugging, bur didn't realise it would make THAT big a difference.

In terms of the explanation- this doesn't make much sense to me. Having dropout would help with overfitting, if anything.

The fact that the model has drastically different performance between train() and eval() modes for norm='right' is very weird. Even though dropout functionality is off in evaluation mode, training mode already scales the dropout layer so that the eval mode does not cause an issue.

What I feel could be happening is that most activations in the layers are zero for norm='right', so at training-time the non-zero parts that the model retains (for any given forward pass) are scaled up. But in eval mode, the outputs remain pretty similar in terms of L0 norm. Since the dropout scaling assumes a uniform distribution in its assumption while scaling, this problem leads to a shift in expected activations. I'll have to look inside the model activations and their trends to confirm/deny this possibility.

Regarding performance- I too would want to simply use the norm='right' setting. However, the dataset paper I had linked above uses norm='right' in their experiments (along with a logical reasoning, given the dataset structure) and hence I wanted to reproduce their model for experiments for my own project.

Rhett-Ying commented 3 years ago

I tried to check the value in each 'xo' which is after activation but before dropout(set as 0.5) with norm='right'.

For comparison norm='both' ~ train stage: raito of that > 0.0001: [0.41208799799862406, 0.40726871251763364, 0.4741714790029256, 0.46045523484895867, 0.48465742072674967, 0.3326060538147754]
Loss: 0.09133 | Precision: 0.865 | Recall: 0.687 | F-1: 0.764 | Avg-logits(0) : 0.530 | Avg

norm='both' ~ eval stage: raito of that > 0.0001: [0.41358413216756207, 0.3351206151901401, 0.4710416008286808, 0.4643259068085256, 0.4586827135784262, 0.39094376558165667]
Loss: 0.10731 | Precision: 0.894 | Recall: 0.917 | F-1: 0.905 | Avg-logits(0) : -0.689 | A

what do you think of this?

iamgroot42 commented 3 years ago

Hmm - it seems like a good fraction of the activations is indeed nonzero and not as drastically lower than the norm='both' case than I had anticipated. I am even more confused now 🐱 Perhaps a better understanding of the different norm methods could help us out here

Rhett-Ying commented 3 years ago

yes. I will look deep into the implementation of GraphConv(norm='right'). will get back to you if any new/more findings.

BarclayII commented 3 years ago

The reason why the code fails is quite subtle; it is related to how you set the input features and how norm='right' works.

First, note that norm='right' means averaging the messages, while norm='both' divides the messages with a factor of sqrt(d_u * d_v) where d_u and d_v represents the degree of the incident nodes.

Now, I looked into your code and found that the input features are the same for all nodes (a single 1). If norm='right', meaning that you simply average the messages, things certainly won't work because the average of the same thing is still the same. As a result, the model will simply predict the same class for every single node. This is the reason why you always get an F1 of 0 in evaluation, and in training with dropout probability of 0. norm='both', in contrast, does not simply average the messages: it adjusts the weighting of each message according to the degrees of incident nodes, thereby giving you different values for each node.

You also observed that training performance looked quite well when dropout probability is not zero. The reason is that in this case node features will randomly change, which ultimately gives you different values for each node. I think your specific case is feeding in the same feature for every node to a 6-layer GNN, with dropouts between the layers. This is, coincidentally, identical to random node feature initialization with Bernoulli distributions for a 5-layer GNN. It is known that random node feature initialization improves GNNs (see here and here).

So if you absolutely want to use norm='right', you'll need to assign different features to every node - either random features or some other handcrafted features.

Feel free to follow up if you have more questions.

iamgroot42 commented 3 years ago

Thanks a lot for the detailed analysis, @BarclayII ! However, I am not sure I fully agree with your explanation here.

This is a bidirected graph, so the in-degrees and out-degrees are the same. In the norm='right' mode, as visible in the code definition here:

degs = graph.in_degrees().to(feat.device).float().clamp(min=1)
norm = 1.0 / degs

As you can see, the output activation indeed depends on the node degrees. Even if all node features are the same, the graph will output different features based on the degrees of nodes and not the same features for all nodes, as you suggested. Please let me know if I am missing something here :)

The architecture I posted above has been used in existing work (off of which I based this experiment) and reached an F-1 score upwards of 0.9. The only difference between their implementation and this one is the library used: they used torch_geometric, while this code is for dgl.

BarclayII commented 3 years ago

This is a bidirected graph, so the in-degrees and out-degrees are the same.

The in-degree and out-degree are the same for the same node. However the denominator of both is the square root of the product between the out-degree of source node and in-degree of destination node, which are not necessarily the same.

As you can see, the output activation indeed depends on the node degrees. Even if all node features are the same, the graph will output different features based on the degrees of nodes and not the same features for all nodes, as you suggested.

Before the code you showed, the output representation is computed via summing the incoming messages. Since the number of incoming messages of a node is the same as the node's in-degree, the output will be the same value.

The architecture I posted above has been used in existing work (off of which I based this experiment) and reached an F-1 score upwards of 0.9. The only difference between their implementation and this one is the library used: they used torch_geometric, while this code is for dgl.

The difference between their normalization and ours is that they divide the outgoing messages by out-degrees before message passing. That is OK.

If I write down the equations things will get clearer. Assuming that x is the same input feature for all nodes.

Theirs: $\sum_{j\in \mathcal{N}(i)} \frac{1}{d_j} x$
Our both: $\sum_{j\in \mathcal{N}(i)} \frac{1}{\sqrt{d_j d_i}} x$
Our right: $\sum_{j\in \mathcal{N}(i)} \frac{1}{d_i} x = x$

With DGL 0.6+ you can specify your own normalization weights using the EdgeWeightNorm module, though I can add another normalization option in GraphConv if you want to.

iamgroot42 commented 3 years ago

It all makes sense now! I am not very familiar with GCNs, so at first glance (looking at their paper), it seemed that their normalization was the same as DGL's right method. Thanks (to you, as well as @Rhett-Ying) for the clear explanation and for taking out the time for it :)

Knowing this difference now, I think I should be able to implement the appropriate code, but I would, of course, appreciate it if it could be part of the library as well!

iamgroot42 commented 3 years ago

@BarclayII I tried the modified logic for degree normalization as you suggested, but it seems even that does not make any difference? Here is the updated gist. Unless I misunderstood the suggested change, it looks like something is still off?

BarclayII commented 3 years ago

In the updated gist you are still dividing the aggregated result after message passing. What they did is to divide the node representations before message passing. So you will need something like:

feat_src = feat_src / degs
g.ndata['h'] = feat_src
g.update_all(fn.copy_u('h', 'm'), fn.sum('m', 'h'))

iamgroot42 commented 3 years ago

Great catch! I updated the gist as per your suggestion, but in that case, the loss and logit values shoot up to ridiculously high values, so I'm not sure what's happening here 😅

BarclayII commented 3 years ago

Hmm which version of DGL are you using? I'm using DGL 0.6 and PyTorch 1.7.1 + CUDA 10.1. I ran your code and got 0.7 F1 within two epochs.

iamgroot42 commented 3 years ago

I had the activation turned on at my end of the code (which caused some issues regarding no gradients flowing back, leading to ~0 F-1 consistently, at least in the first 8 epochs - I stopped after that). Nonetheless, the fact that it worked without non-linearities (I checked it at my end) means it is now working as desired. Thanks for all the help! :)

dmlc / dgl