danielegrattarola / spektral

Graph Neural Networks with Keras and Tensorflow 2.
https://graphneural.network
MIT License
2.37k stars 334 forks source link

No supported conversion for types: (dtype('O'),) #259

Open DavideCerbarano opened 3 years ago

DavideCerbarano commented 3 years ago

Hi Daniele and all,

I'm trying to use DisjointLoader in order to train a GCN to make a node-level prediction. My own dataset consists in a list of graphs with different shapes and large sparse adjacency matrices, that I want to use as training set in order to be able to make a prediction at the level of individual nodes on a new test graph. Now, the problem that I have encountered is that when I call model.fit(DisjointLoader(dataset, node_level=True).load()) this error occurs:

TypeError: no supported conversion for types: (dtype('O'),) [[{{node EagerPyFunc}}]] [[IteratorGetNext]] [Op:__inference_train_function_763] Function call stack: train_function

I can't understand wich is the element with dtype('O') that the error calls back. Here is a piece of the code I used:

import numpy as np
import pandas as pd
from spektral.data import Dataset
import networkx as nx
import pickle
from spektral.layers.ops import sp_matrix_to_sp_tensor
from spektral.utils.convolution import gcn_filter
from mygcn import MyGCN
from spektral.data.loaders import DisjointLoader
from tensorflow.keras.optimizers import Adam
import tensorflow as tf
from tensorflow.keras.callbacks import EarlyStopping
from keras import metrics
from sklearn.preprocessing import RobustScaler

def prepare_data(df, adjacency):
    keyDict = df.index
    dict_list = {key: list for key, list in zip(keyDict, adjacency)}

    scaler = RobustScaler()
    node_features = scaler.fit_transform(df)

    labels = df['H2.Volume Fraction']
    labels= log_transform(labels)

    return node_features, labels, dict_list

class Jet_list(Dataset):

    def __init__(self, simulations,  dtype=np.float32, **kwargs):
        if hasattr(dtype, "as_numpy_dtype"):
            # support tf.dtypes
            dtype = dtype.as_numpy_dtype

        self.dtype=dtype
        self.simulations= simulations
        super().__init__(**kwargs)
        return

    def read(self):
        list_of_graphs = []
        for i in tqdm(range(len(self.simulations))):
            if self.simulations[i] in self.simulations:
                path= 'data/' + self.simulations[i] +'/'
                df= pd.read_hdf(path + 'df.h5')
                with open(path + '\\adj.txt', "rb") as fp:  # Unpickling
                    adjacency = pickle.load(fp)
                x, y, dict_list = prepare_data(df=df, adjacency=adjacency) #preprocessing

                a = nx.adjacency_matrix(nx.from_dict_of_lists(dict_list))  # CSR
                a = gcn_filter(a)  # normalizzo A e aggiungo self-loop: D^-0.5 (A+I) D^-0.5
                a = sp_matrix_to_sp_tensor(a)
                a = tf.cast(a, dtype=tf.float32)

                if not os.path.exists(path + '\\IO'):
                    os.makedirs(path + '\\IO')

                list_of_graphs.append(Graph(x=x.astype(self.dtype), y=y.astype(self.dtype), a=a))

        return list_of_graphs

simulations = ['FLU-10','FLU-11','FLU-13', 'FLU-14','FLU-15','FLU-16','FLU-17','FLU-18']

dataset= Jet_list(simulations)

learning_rate = 1e-2
seed = 0 
epochs = 1000
patience = 400 
tf.random.set_seed(seed=seed)

model = MyGCN(n_input_channels=dataset.n_node_features, dropout_rate=0.3,output_activation='relu')
model.compile(loss='mse', optimizer=Adam(learning_rate), metrics=[metrics.mean_squared_error, metrics.mean_absolute_error])

loader= DisjointLoader(dataset, batch_size=1, node_level=True, shuffle=False)

model_history= model.fit(
        loader.load(),
        steps_per_epoch=loader.steps_per_epoch,
        epochs=epochs,
        callbacks=[EarlyStopping(patience=patience, restore_best_weights=True)])

Where Jet_list generates the list of graphs by reading external data and subclassing spektral.data.Dataset class. Searching elsewhere the bug seems to be due to incompatible version of keras, tensorflow and spektral, but I have no idea about it. Curently I'm using keras 2.4.3, tensorflow 2.5.0, python 3.7.

Could anyone give me a suggestion?

danielegrattarola commented 3 years ago

Hi,

it's funny, I run into this very same issue yesterday :) The problem is that you're transforming the adjacency matrix to a sparse tensor already in the Graph object, but the DisjointLoaderexpects the graph attributes to be numpy/scipy objects.

Remove these lines:

a = sp_matrix_to_sp_tensor(a)
a = tf.cast(a, dtype=tf.float32)

and you should be good.

Cheers

DavideCerbarano commented 3 years ago

Hi Daniele,

thank you for answer! It works, but now it gives me an error of memory allocation:

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 236. GiB for an array with shape (177958, 177958) and data type float64

Where (177958, 177958) is the shape of the first sparse adjacency matrix of the list of graphs that have been generated from Jet_list. As I said, my dataset consists of dozens of very large graphs with sparse adjacency matrices. The weird things is that if I generate a dataset with only one graph, let's say with only the first graph with 177958 nodes, and I use a SingleLoader as data generator, it doesn't give me any problem when I call model.fit(SingleLoader(dataset).load()).

Here is the piece of working code that I used to generate the dataset with only one graph, that is basically equal to the previous case:

class Jet(Dataset):
    def __init__(self, simulation, dtype=np.float32, **kwargs):
        if hasattr(dtype, "as_numpy_dtype"):
            dtype = dtype.as_numpy_dtype

        self.dtype = dtype
        self.simulation = simulation
        super().__init__(**kwargs)
        return

    def read(self):
        path = 'data/' + self.simulation + '/'
        df = pd.read_hdf(path + 'df.h5')
        with open(path + '\\adj.txt', "rb") as fp:  # Unpickling
            adjacency = pickle.load(fp)
        x, y, dict_list = prepare_data(df=df, adjacency=adjacency)  # preprocessing

        a = nx.adjacency_matrix(nx.from_dict_of_lists(dict_list))  # CSR
        a=gcn_filter(a) 

        return [Graph(x=x.astype(self.dtype), y=y.astype(self.dtype), a=a)]

dataset = Jet(simulation='FLU-10')
loader= SingleLoader(dataset)
model.fit(loader.load())

On the contrary, if I generate a dataset with Jet_list with only one graph, basically taking simulations = ['FLU-10'] when I call model.fit(DisjointLoader(dataset, node_level=True).load()) the memory error occurs.

Maybe I'm using DisjointLoader in the wrong way. I've tried also to use BatchLoader and PackedBatchLoader but still gives me problems of memory. I'm not very experienced with these kind of data structures, so I'll be glad if you can give an advice :)

Thank you

danielegrattarola commented 3 years ago

Can you post the full stack trace that you get when the error occurs? Thanks

DavideCerbarano commented 3 years ago

Hi,

here it is.

2021-08-02 14:16:57.183435: W tensorflow/core/common_runtime/bfc_allocator.cc:456] Allocator (mklcpu) ran out of memory trying to allocate 37.25GiB (rounded to 40000000000)requested by op mean_squared_error/SquaredDifference
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2021-08-02 14:16:57.184401: I tensorflow/core/common_runtime/bfc_allocator.cc:991] BFCAllocator dump for mklcpu
2021-08-02 14:16:57.184546: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (256):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.184808: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (512):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.185076: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (1024):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.185317: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (2048):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.185482: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (4096):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.193843: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (8192):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.194302: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (16384):     Total Chunks: 6, Chunks in use: 6. 96.0KiB allocated for chunks. 96.0KiB in use in bin. 96.0KiB client-requested in use in bin.
2021-08-02 14:16:57.194726: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (32768):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.194903: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (65536):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.195070: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (131072):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.195268: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (262144):    Total Chunks: 1, Chunks in use: 1. 390.8KiB allocated for chunks. 390.8KiB in use in bin. 390.6KiB client-requested in use in bin.
2021-08-02 14:16:57.195761: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (524288):    Total Chunks: 2, Chunks in use: 1. 1.52MiB allocated for chunks. 780.0KiB in use in bin. 390.6KiB client-requested in use in bin.
2021-08-02 14:16:57.196176: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (1048576):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.196331: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (2097152):   Total Chunks: 1, Chunks in use: 1. 2.29MiB allocated for chunks. 2.29MiB in use in bin. 2.29MiB client-requested in use in bin.
2021-08-02 14:16:57.196593: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (4194304):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.196847: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (8388608):   Total Chunks: 1, Chunks in use: 1. 8.01MiB allocated for chunks. 8.01MiB in use in bin. 8.01MiB client-requested in use in bin.
2021-08-02 14:16:57.197102: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (16777216):  Total Chunks: 3, Chunks in use: 3. 85.70MiB allocated for chunks. 85.70MiB in use in bin. 73.24MiB client-requested in use in bin.
2021-08-02 14:16:57.197324: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (33554432):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.197587: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (67108864):  Total Chunks: 1, Chunks in use: 1. 64.00MiB allocated for chunks. 64.00MiB in use in bin. 32.04MiB client-requested in use in bin.
2021-08-02 14:16:57.197822: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (134217728):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.198052: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (268435456):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.198262: I tensorflow/core/common_runtime/bfc_allocator.cc:1014] Bin for 37.25GiB was 256.00MiB, Chunk State: 
2021-08-02 14:16:57.198403: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 2097152
2021-08-02 14:16:57.198921: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a33080 of size 16384 by op Fill action_count 2166818383707 step 0 next 1
2021-08-02 14:16:57.199063: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a37080 of size 16384 by op Fill action_count 2166818383708 step 0 next 2
2021-08-02 14:16:57.199318: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a3b080 of size 16384 by op Add action_count 2166818383699 step 0 next 3
2021-08-02 14:16:57.199480: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a3f080 of size 16384 by op Add action_count 2166818383704 step 0 next 4
2021-08-02 14:16:57.199672: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a43080 of size 16384 by op Fill action_count 2166818383709 step 0 next 5
2021-08-02 14:16:57.199871: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a47080 of size 16384 by op Fill action_count 2166818383710 step 0 next 6
2021-08-02 14:16:57.200079: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Free  at 1f881a4b080 of size 800000 by op UNUSED action_count 2166818383715 step 0 next 9
2021-08-02 14:16:57.200277: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881b0e580 of size 400128 by op UNKNOWN action_count 2166818383714 step 0 next 10
2021-08-02 14:16:57.200476: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881b70080 of size 798720 by op my_gcn/dense_3/MatMul action_count 2166818383724 step 12783974653594192978 next 18446744073709551615
2021-08-02 14:16:57.200736: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 33554432
2021-08-02 14:16:57.200869: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f89a753080 of size 2400000 by op my_gcn/gcn_conv/SparseTensorDenseMatMul/SparseTensorDenseMatMul action_count 2166818383719 step 12783974653594192978 next 14
2021-08-02 14:16:57.201152: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f89a99cf80 of size 31154432 by op my_gcn/dense/MatMul action_count 2166818383721 step 12783974653594192978 next 18446744073709551615
2021-08-02 14:16:57.201346: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 67108864
2021-08-02 14:16:57.201430: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f89e6bf080 of size 8399616 by op my_gcn/Cast action_count 2166818383716 step 12783974653594192978 next 12
2021-08-02 14:16:57.201708: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f89eec1b80 of size 25600000 by op my_gcn/dense_1/MatMul action_count 2166818383722 step 12783974653594192978 next 13
2021-08-02 14:16:57.201859: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f8a072bb80 of size 33109248 by op my_gcn/dense_2/MatMul action_count 2166818383723 step 12783974653594192978 next 18446744073709551615
2021-08-02 14:16:57.202127: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 67108864
2021-08-02 14:16:57.202271: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f8a832c080 of size 67108864 by op SparseReorder action_count 2166818383712 step 0 next 18446744073709551615
2021-08-02 14:16:57.202474: I tensorflow/core/common_runtime/bfc_allocator.cc:1051]      Summary of in-use Chunks by size: 
2021-08-02 14:16:57.202658: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 6 Chunks of size 16384 totalling 96.0KiB
2021-08-02 14:16:57.202862: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 400128 totalling 390.8KiB
2021-08-02 14:16:57.203089: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 798720 totalling 780.0KiB
2021-08-02 14:16:57.203256: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 2400000 totalling 2.29MiB
2021-08-02 14:16:57.203448: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 8399616 totalling 8.01MiB
2021-08-02 14:16:57.203635: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 25600000 totalling 24.41MiB
2021-08-02 14:16:57.203843: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 31154432 totalling 29.71MiB
2021-08-02 14:16:57.203994: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 33109248 totalling 31.58MiB
2021-08-02 14:16:57.204162: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 67108864 totalling 64.00MiB
2021-08-02 14:16:57.204298: I tensorflow/core/common_runtime/bfc_allocator.cc:1058] Sum Total of in-use chunks: 161.24MiB
2021-08-02 14:16:57.204433: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] total_region_allocated_bytes_: 169869312 memory_limit_: 68719476736 available bytes: 68549607424 curr_region_allocation_bytes_: 68719476736
2021-08-02 14:16:57.204647: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Stats: 
Limit:                     68719476736
InUse:                       169069312
MaxInUse:                    169069312
NumAllocs:                          21
MaxAllocSize:                 67108864
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0
2021-08-02 14:16:57.205043: W tensorflow/core/common_runtime/bfc_allocator.cc:467] ******************xx*************************************xxx*********************xxxxxxxxxxxxxxxxxxx
2021-08-02 14:16:57.205476: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at cwise_ops_common.cc:82 : Resource exhausted: OOM when allocating tensor with shape[100000,100000] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator mklcpu
2021-08-02 14:17:07.218679: W tensorflow/core/common_runtime/bfc_allocator.cc:456] Allocator (mklcpu) ran out of memory trying to allocate 37.25GiB (rounded to 40000000000)requested by op sub
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2021-08-02 14:17:07.219238: I tensorflow/core/common_runtime/bfc_allocator.cc:991] BFCAllocator dump for mklcpu
2021-08-02 14:17:07.219334: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (256):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.219609: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (512):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.219790: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (1024):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.220093: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (2048):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.220345: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (4096):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.220592: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (8192):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.220842: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (16384):     Total Chunks: 6, Chunks in use: 6. 96.0KiB allocated for chunks. 96.0KiB in use in bin. 96.0KiB client-requested in use in bin.
2021-08-02 14:17:07.221113: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (32768):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.221351: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (65536):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.221509: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (131072):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.221766: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (262144):    Total Chunks: 1, Chunks in use: 1. 390.8KiB allocated for chunks. 390.8KiB in use in bin. 390.6KiB client-requested in use in bin.
2021-08-02 14:17:07.222025: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (524288):    Total Chunks: 2, Chunks in use: 1. 1.52MiB allocated for chunks. 780.0KiB in use in bin. 390.6KiB client-requested in use in bin.
2021-08-02 14:17:07.222268: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (1048576):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.222487: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (2097152):   Total Chunks: 1, Chunks in use: 1. 2.29MiB allocated for chunks. 2.29MiB in use in bin. 2.29MiB client-requested in use in bin.
2021-08-02 14:17:07.222744: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (4194304):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.222984: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (8388608):   Total Chunks: 1, Chunks in use: 1. 8.01MiB allocated for chunks. 8.01MiB in use in bin. 8.01MiB client-requested in use in bin.
2021-08-02 14:17:07.223206: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (16777216):  Total Chunks: 3, Chunks in use: 3. 85.70MiB allocated for chunks. 85.70MiB in use in bin. 73.24MiB client-requested in use in bin.
2021-08-02 14:17:07.223495: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (33554432):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.223743: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (67108864):  Total Chunks: 1, Chunks in use: 1. 64.00MiB allocated for chunks. 64.00MiB in use in bin. 32.04MiB client-requested in use in bin.
2021-08-02 14:17:07.223997: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (134217728):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.224252: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (268435456):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.224457: I tensorflow/core/common_runtime/bfc_allocator.cc:1014] Bin for 37.25GiB was 256.00MiB, Chunk State: 
2021-08-02 14:17:07.224577: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 2097152
2021-08-02 14:17:07.224668: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a33080 of size 16384 by op Fill action_count 2166818383707 step 0 next 1
2021-08-02 14:17:07.224788: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a37080 of size 16384 by op Fill action_count 2166818383708 step 0 next 2
2021-08-02 14:17:07.224921: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a3b080 of size 16384 by op Add action_count 2166818383699 step 0 next 3
2021-08-02 14:17:07.225193: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a3f080 of size 16384 by op Add action_count 2166818383704 step 0 next 4
2021-08-02 14:17:07.225385: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a43080 of size 16384 by op Fill action_count 2166818383709 step 0 next 5
2021-08-02 14:17:07.225600: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a47080 of size 16384 by op Fill action_count 2166818383710 step 0 next 6
2021-08-02 14:17:07.225795: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Free  at 1f881a4b080 of size 800000 by op UNUSED action_count 2166818383715 step 0 next 9
2021-08-02 14:17:07.226000: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881b0e580 of size 400128 by op UNKNOWN action_count 2166818383714 step 0 next 10
2021-08-02 14:17:07.226210: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881b70080 of size 798720 by op my_gcn/dense_3/MatMul action_count 2166818383724 step 12783974653594192978 next 18446744073709551615
2021-08-02 14:17:07.226457: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 33554432
2021-08-02 14:17:07.226561: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f89a753080 of size 2400000 by op my_gcn/gcn_conv/SparseTensorDenseMatMul/SparseTensorDenseMatMul action_count 2166818383719 step 12783974653594192978 next 14
2021-08-02 14:17:07.226859: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f89a99cf80 of size 31154432 by op my_gcn/dense/MatMul action_count 2166818383721 step 12783974653594192978 next 18446744073709551615
2021-08-02 14:17:07.227105: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 67108864
2021-08-02 14:17:07.227229: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f89e6bf080 of size 8399616 by op my_gcn/Cast action_count 2166818383716 step 12783974653594192978 next 12
2021-08-02 14:17:07.227459: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f89eec1b80 of size 25600000 by op my_gcn/dense_1/MatMul action_count 2166818383722 step 12783974653594192978 next 13
2021-08-02 14:17:07.227663: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f8a072bb80 of size 33109248 by op my_gcn/dense_2/MatMul action_count 2166818383723 step 12783974653594192978 next 18446744073709551615
2021-08-02 14:17:07.227828: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 67108864
2021-08-02 14:17:07.227916: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f8a832c080 of size 67108864 by op SparseReorder action_count 2166818383712 step 0 next 18446744073709551615
2021-08-02 14:17:07.228289: I tensorflow/core/common_runtime/bfc_allocator.cc:1051]      Summary of in-use Chunks by size: 
2021-08-02 14:17:07.228453: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 6 Chunks of size 16384 totalling 96.0KiB
2021-08-02 14:17:07.228628: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 400128 totalling 390.8KiB
2021-08-02 14:17:07.228808: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 798720 totalling 780.0KiB
2021-08-02 14:17:07.228988: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 2400000 totalling 2.29MiB
2021-08-02 14:17:07.229158: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 8399616 totalling 8.01MiB
2021-08-02 14:17:07.229325: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 25600000 totalling 24.41MiB
2021-08-02 14:17:07.229522: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 31154432 totalling 29.71MiB
2021-08-02 14:17:07.229686: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 33109248 totalling 31.58MiB
2021-08-02 14:17:07.229869: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 67108864 totalling 64.00MiB
2021-08-02 14:17:07.230026: I tensorflow/core/common_runtime/bfc_allocator.cc:1058] Sum Total of in-use chunks: 161.24MiB
2021-08-02 14:17:07.230176: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] total_region_allocated_bytes_: 169869312 memory_limit_: 68719476736 available bytes: 68549607424 curr_region_allocation_bytes_: 68719476736
2021-08-02 14:17:07.230414: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Stats: 
Limit:                     68719476736
InUse:                       169069312
MaxInUse:                    169069312
NumAllocs:                          21
MaxAllocSize:                 67108864
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0
2021-08-02 14:17:07.230873: W tensorflow/core/common_runtime/bfc_allocator.cc:467] ******************xx*************************************xxx*********************xxxxxxxxxxxxxxxxxxx
2021-08-02 14:17:07.231094: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at cwise_ops_common.cc:82 : Resource exhausted: OOM when allocating tensor with shape[100000,100000] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator mklcpu
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Program Files\JetBrains\PyCharm 2021.1.3\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm 2021.1.3\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/david/OneDrive/Desktop/ML/0_TESI/GITHUB/01_Train all/D_Train all.py", line 22, in <module>
    Train_graph_regression_1(simulations)
  File "C:\Users\david\OneDrive\Desktop\ML\0_TESI\GITHUB\01_Train all\C_Spektral_graph_regression_1.py", line 67, in Train_graph_regression_1
    callbacks=[EarlyStopping(patience=patience, restore_best_weights=True)])
  File "C:\Users\david\.conda\envs\Tesi2\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1183, in fit
    tmp_logs = self.train_function(iterator)
  File "C:\Users\david\.conda\envs\Tesi2\lib\site-packages\tensorflow\python\eager\def_function.py", line 889, in __call__
    result = self._call(*args, **kwds)
  File "C:\Users\david\.conda\envs\Tesi2\lib\site-packages\tensorflow\python\eager\def_function.py", line 950, in _call
    return self._stateless_fn(*args, **kwds)
  File "C:\Users\david\.conda\envs\Tesi2\lib\site-packages\tensorflow\python\eager\function.py", line 3024, in __call__
    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
  File "C:\Users\david\.conda\envs\Tesi2\lib\site-packages\tensorflow\python\eager\function.py", line 1961, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "C:\Users\david\.conda\envs\Tesi2\lib\site-packages\tensorflow\python\eager\function.py", line 596, in call
    ctx=ctx)
  File "C:\Users\david\.conda\envs\Tesi2\lib\site-packages\tensorflow\python\eager\execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[100000,100000] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator mklcpu
     [[node mean_squared_error/SquaredDifference (defined at C:\Users\david\OneDrive\Desktop\ML\0_TESI\GITHUB\01_Train all\C_Spektral_graph_regression_1.py:67) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_train_function_709]
Errors may have originated from an input operation.
Input Source operations connected to node mean_squared_error/SquaredDifference:
 my_gcn/dense_3/MatMul (defined at C:\Users\david\OneDrive\Desktop\ML\0_TESI\GITHUB\01_Train all\mygcn.py:116)
Function call stack:
train_function

Thank you

danielegrattarola commented 3 years ago

That didn't help as much as I'd hoped :)

Are you running on the latest version of Spektral, installed from source? If not, could you try?

git clone https://github.com/danielegrattarola/spektral.git
cd spektral
python setup.py install  # Or 'pip install .'

I expect the memory error to happen with BatchLoader, but DisjointLoader should not cause issues (since it keeps the matrix sparse). Can you also make sure that dataset[0].a is a Scipy sparse matrix?

Cheers

DavideCerbarano commented 3 years ago

I was using spektral 1.0.5. With spektral 1.0.7 another error occurs, but it seems different:

2021-08-04 14:04:18.968597: W tensorflow/core/framework/op_kernel.cc:1755] Invalid argument: TypeError: `generator` yielded an element of ((TensorSpec(shape=(177958, 6), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([177958, 177958]), tf.int32), TensorSpec(shape=(177958,), dtype=tf.int64, name=None)), TensorSpec(shape=(177958, 1), dtype=tf.float64, name=None)) where an element of ((TensorSpec(shape=(None, 6), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([None, None]), tf.int32), TensorSpec(shape=(None,), dtype=tf.int64, name=None)), TensorSpec(shape=(None, 177958), dtype=tf.float64, name=None)) was expected.
Traceback (most recent call last):
  File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\ops\script_ops.py", line 247, in __call__
    return func(device, token, args)
  File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\ops\script_ops.py", line 135, in __call__
    ret = self._func(*args)
  File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 645, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py", line 1048, in generator_py_func
    "of %s was expected." % (values_spec, output_signature))
TypeError: `generator` yielded an element of ((TensorSpec(shape=(177958, 6), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([177958, 177958]), tf.int32), TensorSpec(shape=(177958,), dtype=tf.int64, name=None)), TensorSpec(shape=(177958, 1), dtype=tf.float64, name=None)) where an element of ((TensorSpec(shape=(None, 6), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([None, None]), tf.int32), TensorSpec(shape=(None,), dtype=tf.int64, name=None)), TensorSpec(shape=(None, 177958), dtype=tf.float64, name=None)) was expected.
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Program Files\JetBrains\PyCharm 2021.1.3\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm 2021.1.3\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/lorti/Desktop/davide/davide/1_git/D_Train all.py", line 20, in <module>
    Train_graph_regression(simulations)
  File "C:\Users\lorti\Desktop\davide\davide\1_git\C_Spektral_graph_regression.py", line 68, in Train_graph_regression
    callbacks=[EarlyStopping(patience=patience, restore_best_weights=True)])
  File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1183, in fit
    tmp_logs = self.train_function(iterator)
  File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\eager\def_function.py", line 889, in __call__
    result = self._call(*args, **kwds)
  File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\eager\def_function.py", line 950, in _call
    return self._stateless_fn(*args, **kwds)
  File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\eager\function.py", line 3024, in __call__
    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
  File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\eager\function.py", line 1961, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\eager\function.py", line 596, in call
    ctx=ctx)
  File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\eager\execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError:  TypeError: `generator` yielded an element of ((TensorSpec(shape=(177958, 6), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([177958, 177958]), tf.int32), TensorSpec(shape=(177958,), dtype=tf.int64, name=None)), TensorSpec(shape=(177958, 1), dtype=tf.float64, name=None)) where an element of ((TensorSpec(shape=(None, 6), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([None, None]), tf.int32), TensorSpec(shape=(None,), dtype=tf.int64, name=None)), TensorSpec(shape=(None, 177958), dtype=tf.float64, name=None)) was expected.
Traceback (most recent call last):
  File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\ops\script_ops.py", line 247, in __call__
    return func(device, token, args)
  File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\ops\script_ops.py", line 135, in __call__
    ret = self._func(*args)
  File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 645, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py", line 1048, in generator_py_func
    "of %s was expected." % (values_spec, output_signature))
TypeError: `generator` yielded an element of ((TensorSpec(shape=(177958, 6), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([177958, 177958]), tf.int32), TensorSpec(shape=(177958,), dtype=tf.int64, name=None)), TensorSpec(shape=(177958, 1), dtype=tf.float64, name=None)) where an element of ((TensorSpec(shape=(None, 6), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([None, None]), tf.int32), TensorSpec(shape=(None,), dtype=tf.int64, name=None)), TensorSpec(shape=(None, 177958), dtype=tf.float64, name=None)) was expected.
     [[{{node EagerPyFunc}}]]
     [[IteratorGetNext]] [Op:__inference_train_function_715]
Function call stack:
train_function

I'm using DisjointLoader. I checked the adjacency matrix of every graph: they are all Scipy csr_matrix.

danielegrattarola commented 3 years ago

Huh, ok so it seems that the problem is in the model, not the DisjointLoader. Can you post the code for MyGCN? Or even just the output of model.summary()?

It seems like your GNN is outputting a final prediction as big as the number of nodes, which is what was causing your OOM probably even before updating to 1.0.7.

DavideCerbarano commented 3 years ago

Here is MyGCN:

import tensorflow as tf
from tensorflow.keras.layers import Dense
from spektral.layers.convolutional import gcn_conv

class MyGCN(tf.keras.Model):

    def __init__(
        self,
        #n_labels,
        channels=16,
        activation="tanh",
        output_activation="relu",
        use_bias=False,
        dropout_rate=0.1,
        l2_reg=2.5e-4,
        n_input_channels=None,
        **kwargs,
    ):
        super().__init__(**kwargs)

        self.channels = channels
        self.activation = activation
        self.output_activation = output_activation
        self.use_bias = use_bias
        self.dropout_rate = dropout_rate
        self.l2_reg = l2_reg
        self.n_input_channels = n_input_channels
        reg = tf.keras.regularizers.l2(l2_reg)
        self._gcn0 = gcn_conv.GCNConv(n_input_channels, activation='relu', kernel_regularizer=reg, use_bias=use_bias)
        self._dense0= Dense(units=64, activation=activation, use_bias=use_bias)
        self._dense1 = Dense(units=64, activation=activation, use_bias=use_bias)
        self._dense2 = Dense(units=64, activation=activation, use_bias=use_bias)
        self._dense4 = Dense(units=1, activation='linear', use_bias=use_bias)

        if tf.version.VERSION < "2.2":
            if n_input_channels is None:
                raise ValueError("n_input_channels required for tf < 2.2")
            x = tf.keras.Input((n_input_channels,), dtype=tf.float32)
            a = tf.keras.Input((None,), dtype=tf.float32, sparse=True)
            self._set_inputs((x, a))

    def get_config(self):
        return dict(
            #n_labels=self.n_labels,
            channels=self.channels,
            activation=self.activation,
            output_activation=self.output_activation,
            use_bias=self.use_bias,
            dropout_rate=self.dropout_rate,
            l2_reg=self.l2_reg,
            n_input_channels=self.n_input_channels)

    def call(self, inputs):

        if len(inputs) == 2:
            x, a = inputs
        else:
            x, a, _ = inputs  
        if self.n_input_channels is None:
            self.n_input_channels = x.shape[-1]
        else:
            assert self.n_input_channels == x.shape[-1]

        x = self._gcn0([x, a])
        x = self._dense0(x)
        x = self._dense1(x)
        x = self._dense2(x)
        return self._dense4(x)

The task is a node regression. Thank you

nmavesmoore commented 2 years ago

Hi,

I am having a similar issue, but instead with the built in model spektral.models.gcn.GCN( ). I am just trying to do node classification over multiple graphs, and am using disjoint loader with node_level = True. However, whenever I try to train the model I get a tensorflow input shape error. I am wondering if the above problem was ever resolved, as it is very similar to the one I was having. Thank you!

danielegrattarola commented 2 years ago

@nmavesmoore have you tried the solution at the top of the thread?

nmavesmoore commented 2 years ago

Yes I have, and it results in the following traceback:

` TypeError: generator yielded an element of ((TensorSpec(shape=(3, 13), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([3, 3]), tf.float64), TensorSpec(shape=(3,), dtype=tf.int64, name=None)), TensorSpec(shape=(3, 1), dtype=tf.float64, name=None)) where an element of ((TensorSpec(shape=(None, 13), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([None, None]), tf.float64), TensorSpec(shape=(None,), dtype=tf.int64, name=None)), TensorSpec(shape=(None, 305), dtype=tf.float64, name=None)) was expected.

 [[{{node EagerPyFunc}}]]
 [[IteratorGetNext]] [Op:__inference_train_function_1607]

`

danielegrattarola commented 2 years ago

@nmavesmoore From the error it seems that your model expects a target of dimension 305 (TensorSpec(shape=(None, 305), dtype=tf.float64, name=None)) but your graphs have one-dimensional labels (TensorSpec(shape=(3, 1), dtype=tf.float64, name=None)).

Could that be a bug on your end?

nmavesmoore commented 2 years ago

I checked the dimension of my label and for a given graph it is a one-dimensional list of size (number_of_nodes, ). Once I load it into my custom dataset, the dimensionality is preserved, but once I put the dataset into a disjointloader with node_level = True, and run

batch = loader_tr.__next__() inputs, target = batch print(target.shape)

I see that the target (what I assume is my label) is of shape (number_of_nodes, 1).

danielegrattarola commented 2 years ago

That extra 1 dimension is expected, but the problem I think is with the model, because it expects a 305-dimensional label instead of a 1-dimensional label. Can you check where that 305 comes from?

nmavesmoore commented 2 years ago

I just checked and I think it was coming from my model definition. I call

model = GCN(data_tr.n_labels, channels = 12 )

where data_tr.n_labels = 145. When I set the value to 1 (number of labels at the nodel level) I get another error that is as follows: File "c:\Users\nathanm\Desktop\projects\2022\NX AI\region-recognition-ai\nxai_env\lib\site-packages\keras\backend.py", line 5283, in binary_crossentropy return tf.nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output) ValueError:logitsandlabelsmust have the same shape, received ((None, 1) vs (None, 145)).

danielegrattarola commented 2 years ago

I'm not sure I follow, without having access to the full code it's difficult to say. But basically you need to make sure that the number of labels in the data (data_tr.n_labels), the size of the target in a batch (target.shape[-1]) and the size of the output of the model are the same.

In the model you posted above, the output is hardcoded to 1 so maybe there's an issue there? Also what I said before,

That extra 1 dimension is expected

is only true if the labels are scalars, but if they are 145-dimensional the last dimension of the output should be 145.

nmavesmoore commented 2 years ago

data_tr.n_labels does not equal 1 in this case, but the number of nodes in the largest graph in the dataset. Could that be the issue? That being said, when I hard code model = GCN(1, channels = 12 ) I get a separate error. Leaving it blank leads to an error in calling the model.

danielegrattarola commented 2 years ago

Yes, that's probably the issue and possibly related to how you're building the graph data.

nmavesmoore commented 2 years ago

That was the problem. Thank you so much!