danielegrattarola / spektral

Graph Neural Networks with Keras and Tensorflow 2.
https://graphneural.network
MIT License
2.37k stars 334 forks source link

Node classification masking on batch mode #360

Open claudiocapanema opened 2 years ago

claudiocapanema commented 2 years ago

Hi!

I performed a node classification task using masking on batch mode. It runs correctly. However, I also need to get more metrics of the model by using the classification report of scikit learn (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html). I have observed that in my problem I have a total of 12k nodes considering all graphs. however, the tensor of predictions of the model has way more nodes, about 32k (if I sum the nodes of all graphs). This is probably due to the added masked samples. The point is that I need to extract the real predictions from the vector of predictions of the model since some of them are related to masked samples. Then, I can input the classification report with the predictions and labels of real samples. How can I do that?

Thank you!

danielegrattarola commented 2 years ago

There is no clean way to do it. The best solution is to get the batch directly from the BatchLoader and then you can retrieve the mask as:

for batch in loader: 
    (x, a), y = batch
    mask = x[..., -1]

this must be done before calling the model, because the model will get rid of the dummy feature representing the mask.

Cheers

claudiocapanema commented 2 years ago

@danielegrattarola thank you. One more question:

I did this and it is working now after some modifications. But I am still having some problems. This "for" loop iterates infinitely. I also configured batches of bigger sizes but it still runs infinitely. Therefore, I decided to stop the loop when the batch count is equal to "int(len(dataset_te)/batch_size)". Now I am facing another problem: the accuracy shown by the "model.fit" (i.e.,the accuracy of the test dataset) is different from the accuracy calculated by the "classification report" of scikit learn.

danielegrattarola commented 2 years ago

To do just one epoch over the dataset, set epochs=1 when creating the loader, otherwise it will loop forever. For the accuracy, make sure that the inputs to the sklearn function are exactly in the correct format, this is usually due to some weird broadcasting that sklearn does silently.

claudiogscc commented 2 years ago

Hi @danielegrattarola. I am going to present a short course at the end of this month using the Spektral library, so that's why I am very interested. I think you were referring to the parameter "batch_size" when creating the "BatchLoarder", right?

Alright, so I conducted several tests, varying the "batch_size" from 1 to "len(dataset_te)". I am still having the same problem. Also, I have investigated whether there is any hidden issue and I found two more:

1 - As you said I need to stop the Loader iteration because it runs infinitely. I could do this by using a "count" of iterations or by stopping when the current batch is exactly the same as the first one. I believe that this last option is safer. Therefore, I used the code below: first_adjacency_bool = True first_ajdacency = [] for batch in loader_va: (x, a), y = batch mask = x[:, :, -1] print("iteration: ", i)

get first adjacency matrix

    if first_adjacency_bool:
        first_ajdacency = a
        first_adjacency_bool = False
    else:
        # stop infinite iteration when first and current adjacency matrices are equal
        if np.equal(a, first_ajdacency).all():
            print("--------")
            print("finished: ", i)
            print("current adjacency")
            print(a)
            print("previous adjacency")
            print(first_ajdacency)
            print("--------")
            break

The problem is that I have observed that this loop stops in the second iteration because all the adjacency matrices of both batches are EQUALS. So the question is: is the loader returning the same batches? I have tested this using different sizes of batches.

2 - The total number of nodes of all graphs that I am using is over 829k. If I do "batch_size=1" and apply the mask to get the real nodes, the total number of nudes is way below, about 80k. This same problem occurs for other values of "batch_size".

danielegrattarola commented 2 years ago

No, I was referring to the epochs parameter of BatchLoader, which controls the number of times that the loader will loop over the entire dataset. The typical training loop for neural networks is

for epoch in range(epochs):
    for batch in batches:
        train_model(batch)

The epochs parameter of a loader determines the number of iterations in the outer loop, the batch_size parameter determines the number of samples that make up a batch.

It would be helpful if that code you posted was indented, I can't really tell what the problem could be here. Anyway, no need to check that the adjacency matrix is equal, you just set epochs=1 and it will do what you want.

For 2: the total number of nodes in a dataset is not a piece of relevant information here. You care about the number of nodes in a single graph, which from what you say would be closer to 80k in your case. Your batches will have batch_size graphs with n_nodes nodes, where n_nodes is the largest number of nodes for one graph in the batch. You really should read the documentation of the loader to better understand what's going on: https://graphneural.network/loaders/#batchloader

claudiogscc commented 2 years ago

@danielegrattarola Ok, I got it. Now, it works as you suggested.

Now, the main problem is the difference between the performance shown by the training process (80% of accuracy) and by the "classification report" of scikit-learn (41% of accuracy). Just to remember, I am doing a node classification task using the "BatchLoader" and setting mask=True, node_level=True I am also using the GraphMasking layer. I have observed that this layer removes the last dimension of "X" that contains the masks. To me, it seems that the model considers the masked nodes as valid and that's why there is a difference between the performance shown during the training and using the "classification report".

A very important thing that I am doing is the following: after training the model, I remove the masked nodes from the predictions of predictions = model.predict(test_data). This is done to only get the predictions made about real nodes. I send those predictions to the "classification_report" to get the results.

Don't you think that this should be also done in the training step? that is, the predictions of the model should be filtered to get the ones that are related to real nodes before calculating the loss and accuracy of the model in the training phase.

danielegrattarola commented 2 years ago

Yes, I think you might be correct. The graphmasking layer was mostly designed for graph-level prediction so I don't think I've ever seen it used for node-level prediction.

My suggestion is still to write your own training loop and take care of the mask manually.

claudiogscc commented 2 years ago

Do you have a clue about how to do so?

` @tf.function

def train(x, a, y, mask):
    with tf.GradientTape() as tape:
        predictions = model(inputs=[x, a], training=True)
        predictions_masked = predictions[mask]
        loss = loss_fn(y, predictions_masked)
        loss += sum(model.losses)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    return loss

`

I cannot do it because the predictions is a tensor of shape (batch_size,N,n_classes), which means that for each graph I have to index/select the real nodes. I imagine doing so through a for but the tensorflow does not let us do this.