Open DavideCerbarano opened 3 years ago
Hi,
it's funny, I run into this very same issue yesterday :)
The problem is that you're transforming the adjacency matrix to a sparse tensor already in the Graph
object, but the DisjointLoader
expects the graph attributes to be numpy/scipy objects.
Remove these lines:
a = sp_matrix_to_sp_tensor(a)
a = tf.cast(a, dtype=tf.float32)
and you should be good.
Cheers
Hi Daniele,
thank you for answer! It works, but now it gives me an error of memory allocation:
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 236. GiB for an array with shape (177958, 177958) and data type float64
Where (177958, 177958)
is the shape of the first sparse adjacency matrix of the list of graphs that have been generated from Jet_list
. As I said, my dataset consists of dozens of very large graphs with sparse adjacency matrices.
The weird things is that if I generate a dataset with only one graph, let's say with only the first graph with 177958
nodes, and I use a SingleLoader as data generator, it doesn't give me any problem when I call model.fit(SingleLoader(dataset).load())
.
Here is the piece of working code that I used to generate the dataset with only one graph, that is basically equal to the previous case:
class Jet(Dataset):
def __init__(self, simulation, dtype=np.float32, **kwargs):
if hasattr(dtype, "as_numpy_dtype"):
dtype = dtype.as_numpy_dtype
self.dtype = dtype
self.simulation = simulation
super().__init__(**kwargs)
return
def read(self):
path = 'data/' + self.simulation + '/'
df = pd.read_hdf(path + 'df.h5')
with open(path + '\\adj.txt', "rb") as fp: # Unpickling
adjacency = pickle.load(fp)
x, y, dict_list = prepare_data(df=df, adjacency=adjacency) # preprocessing
a = nx.adjacency_matrix(nx.from_dict_of_lists(dict_list)) # CSR
a=gcn_filter(a)
return [Graph(x=x.astype(self.dtype), y=y.astype(self.dtype), a=a)]
dataset = Jet(simulation='FLU-10')
loader= SingleLoader(dataset)
model.fit(loader.load())
On the contrary, if I generate a dataset with Jet_list
with only one graph, basically taking simulations = ['FLU-10']
when I call model.fit(DisjointLoader(dataset, node_level=True).load())
the memory error occurs.
Maybe I'm using DisjointLoader
in the wrong way. I've tried also to use BatchLoader
and PackedBatchLoader
but still gives me problems of memory. I'm not very experienced with these kind of data structures, so I'll be glad if you can give an advice :)
Thank you
Can you post the full stack trace that you get when the error occurs? Thanks
Hi,
here it is.
2021-08-02 14:16:57.183435: W tensorflow/core/common_runtime/bfc_allocator.cc:456] Allocator (mklcpu) ran out of memory trying to allocate 37.25GiB (rounded to 40000000000)requested by op mean_squared_error/SquaredDifference
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
2021-08-02 14:16:57.184401: I tensorflow/core/common_runtime/bfc_allocator.cc:991] BFCAllocator dump for mklcpu
2021-08-02 14:16:57.184546: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (256): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.184808: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (512): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.185076: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (1024): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.185317: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (2048): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.185482: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.193843: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (8192): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.194302: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (16384): Total Chunks: 6, Chunks in use: 6. 96.0KiB allocated for chunks. 96.0KiB in use in bin. 96.0KiB client-requested in use in bin.
2021-08-02 14:16:57.194726: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (32768): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.194903: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (65536): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.195070: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (131072): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.195268: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (262144): Total Chunks: 1, Chunks in use: 1. 390.8KiB allocated for chunks. 390.8KiB in use in bin. 390.6KiB client-requested in use in bin.
2021-08-02 14:16:57.195761: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (524288): Total Chunks: 2, Chunks in use: 1. 1.52MiB allocated for chunks. 780.0KiB in use in bin. 390.6KiB client-requested in use in bin.
2021-08-02 14:16:57.196176: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (1048576): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.196331: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (2097152): Total Chunks: 1, Chunks in use: 1. 2.29MiB allocated for chunks. 2.29MiB in use in bin. 2.29MiB client-requested in use in bin.
2021-08-02 14:16:57.196593: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (4194304): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.196847: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (8388608): Total Chunks: 1, Chunks in use: 1. 8.01MiB allocated for chunks. 8.01MiB in use in bin. 8.01MiB client-requested in use in bin.
2021-08-02 14:16:57.197102: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (16777216): Total Chunks: 3, Chunks in use: 3. 85.70MiB allocated for chunks. 85.70MiB in use in bin. 73.24MiB client-requested in use in bin.
2021-08-02 14:16:57.197324: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (33554432): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.197587: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (67108864): Total Chunks: 1, Chunks in use: 1. 64.00MiB allocated for chunks. 64.00MiB in use in bin. 32.04MiB client-requested in use in bin.
2021-08-02 14:16:57.197822: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (134217728): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.198052: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (268435456): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:16:57.198262: I tensorflow/core/common_runtime/bfc_allocator.cc:1014] Bin for 37.25GiB was 256.00MiB, Chunk State:
2021-08-02 14:16:57.198403: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 2097152
2021-08-02 14:16:57.198921: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a33080 of size 16384 by op Fill action_count 2166818383707 step 0 next 1
2021-08-02 14:16:57.199063: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a37080 of size 16384 by op Fill action_count 2166818383708 step 0 next 2
2021-08-02 14:16:57.199318: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a3b080 of size 16384 by op Add action_count 2166818383699 step 0 next 3
2021-08-02 14:16:57.199480: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a3f080 of size 16384 by op Add action_count 2166818383704 step 0 next 4
2021-08-02 14:16:57.199672: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a43080 of size 16384 by op Fill action_count 2166818383709 step 0 next 5
2021-08-02 14:16:57.199871: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a47080 of size 16384 by op Fill action_count 2166818383710 step 0 next 6
2021-08-02 14:16:57.200079: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Free at 1f881a4b080 of size 800000 by op UNUSED action_count 2166818383715 step 0 next 9
2021-08-02 14:16:57.200277: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881b0e580 of size 400128 by op UNKNOWN action_count 2166818383714 step 0 next 10
2021-08-02 14:16:57.200476: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881b70080 of size 798720 by op my_gcn/dense_3/MatMul action_count 2166818383724 step 12783974653594192978 next 18446744073709551615
2021-08-02 14:16:57.200736: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 33554432
2021-08-02 14:16:57.200869: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f89a753080 of size 2400000 by op my_gcn/gcn_conv/SparseTensorDenseMatMul/SparseTensorDenseMatMul action_count 2166818383719 step 12783974653594192978 next 14
2021-08-02 14:16:57.201152: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f89a99cf80 of size 31154432 by op my_gcn/dense/MatMul action_count 2166818383721 step 12783974653594192978 next 18446744073709551615
2021-08-02 14:16:57.201346: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 67108864
2021-08-02 14:16:57.201430: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f89e6bf080 of size 8399616 by op my_gcn/Cast action_count 2166818383716 step 12783974653594192978 next 12
2021-08-02 14:16:57.201708: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f89eec1b80 of size 25600000 by op my_gcn/dense_1/MatMul action_count 2166818383722 step 12783974653594192978 next 13
2021-08-02 14:16:57.201859: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f8a072bb80 of size 33109248 by op my_gcn/dense_2/MatMul action_count 2166818383723 step 12783974653594192978 next 18446744073709551615
2021-08-02 14:16:57.202127: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 67108864
2021-08-02 14:16:57.202271: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f8a832c080 of size 67108864 by op SparseReorder action_count 2166818383712 step 0 next 18446744073709551615
2021-08-02 14:16:57.202474: I tensorflow/core/common_runtime/bfc_allocator.cc:1051] Summary of in-use Chunks by size:
2021-08-02 14:16:57.202658: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 6 Chunks of size 16384 totalling 96.0KiB
2021-08-02 14:16:57.202862: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 400128 totalling 390.8KiB
2021-08-02 14:16:57.203089: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 798720 totalling 780.0KiB
2021-08-02 14:16:57.203256: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 2400000 totalling 2.29MiB
2021-08-02 14:16:57.203448: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 8399616 totalling 8.01MiB
2021-08-02 14:16:57.203635: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 25600000 totalling 24.41MiB
2021-08-02 14:16:57.203843: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 31154432 totalling 29.71MiB
2021-08-02 14:16:57.203994: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 33109248 totalling 31.58MiB
2021-08-02 14:16:57.204162: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 67108864 totalling 64.00MiB
2021-08-02 14:16:57.204298: I tensorflow/core/common_runtime/bfc_allocator.cc:1058] Sum Total of in-use chunks: 161.24MiB
2021-08-02 14:16:57.204433: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] total_region_allocated_bytes_: 169869312 memory_limit_: 68719476736 available bytes: 68549607424 curr_region_allocation_bytes_: 68719476736
2021-08-02 14:16:57.204647: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Stats:
Limit: 68719476736
InUse: 169069312
MaxInUse: 169069312
NumAllocs: 21
MaxAllocSize: 67108864
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2021-08-02 14:16:57.205043: W tensorflow/core/common_runtime/bfc_allocator.cc:467] ******************xx*************************************xxx*********************xxxxxxxxxxxxxxxxxxx
2021-08-02 14:16:57.205476: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at cwise_ops_common.cc:82 : Resource exhausted: OOM when allocating tensor with shape[100000,100000] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator mklcpu
2021-08-02 14:17:07.218679: W tensorflow/core/common_runtime/bfc_allocator.cc:456] Allocator (mklcpu) ran out of memory trying to allocate 37.25GiB (rounded to 40000000000)requested by op sub
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
2021-08-02 14:17:07.219238: I tensorflow/core/common_runtime/bfc_allocator.cc:991] BFCAllocator dump for mklcpu
2021-08-02 14:17:07.219334: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (256): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.219609: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (512): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.219790: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (1024): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.220093: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (2048): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.220345: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.220592: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (8192): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.220842: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (16384): Total Chunks: 6, Chunks in use: 6. 96.0KiB allocated for chunks. 96.0KiB in use in bin. 96.0KiB client-requested in use in bin.
2021-08-02 14:17:07.221113: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (32768): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.221351: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (65536): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.221509: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (131072): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.221766: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (262144): Total Chunks: 1, Chunks in use: 1. 390.8KiB allocated for chunks. 390.8KiB in use in bin. 390.6KiB client-requested in use in bin.
2021-08-02 14:17:07.222025: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (524288): Total Chunks: 2, Chunks in use: 1. 1.52MiB allocated for chunks. 780.0KiB in use in bin. 390.6KiB client-requested in use in bin.
2021-08-02 14:17:07.222268: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (1048576): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.222487: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (2097152): Total Chunks: 1, Chunks in use: 1. 2.29MiB allocated for chunks. 2.29MiB in use in bin. 2.29MiB client-requested in use in bin.
2021-08-02 14:17:07.222744: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (4194304): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.222984: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (8388608): Total Chunks: 1, Chunks in use: 1. 8.01MiB allocated for chunks. 8.01MiB in use in bin. 8.01MiB client-requested in use in bin.
2021-08-02 14:17:07.223206: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (16777216): Total Chunks: 3, Chunks in use: 3. 85.70MiB allocated for chunks. 85.70MiB in use in bin. 73.24MiB client-requested in use in bin.
2021-08-02 14:17:07.223495: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (33554432): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.223743: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (67108864): Total Chunks: 1, Chunks in use: 1. 64.00MiB allocated for chunks. 64.00MiB in use in bin. 32.04MiB client-requested in use in bin.
2021-08-02 14:17:07.223997: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (134217728): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.224252: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (268435456): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-08-02 14:17:07.224457: I tensorflow/core/common_runtime/bfc_allocator.cc:1014] Bin for 37.25GiB was 256.00MiB, Chunk State:
2021-08-02 14:17:07.224577: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 2097152
2021-08-02 14:17:07.224668: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a33080 of size 16384 by op Fill action_count 2166818383707 step 0 next 1
2021-08-02 14:17:07.224788: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a37080 of size 16384 by op Fill action_count 2166818383708 step 0 next 2
2021-08-02 14:17:07.224921: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a3b080 of size 16384 by op Add action_count 2166818383699 step 0 next 3
2021-08-02 14:17:07.225193: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a3f080 of size 16384 by op Add action_count 2166818383704 step 0 next 4
2021-08-02 14:17:07.225385: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a43080 of size 16384 by op Fill action_count 2166818383709 step 0 next 5
2021-08-02 14:17:07.225600: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881a47080 of size 16384 by op Fill action_count 2166818383710 step 0 next 6
2021-08-02 14:17:07.225795: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Free at 1f881a4b080 of size 800000 by op UNUSED action_count 2166818383715 step 0 next 9
2021-08-02 14:17:07.226000: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881b0e580 of size 400128 by op UNKNOWN action_count 2166818383714 step 0 next 10
2021-08-02 14:17:07.226210: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f881b70080 of size 798720 by op my_gcn/dense_3/MatMul action_count 2166818383724 step 12783974653594192978 next 18446744073709551615
2021-08-02 14:17:07.226457: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 33554432
2021-08-02 14:17:07.226561: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f89a753080 of size 2400000 by op my_gcn/gcn_conv/SparseTensorDenseMatMul/SparseTensorDenseMatMul action_count 2166818383719 step 12783974653594192978 next 14
2021-08-02 14:17:07.226859: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f89a99cf80 of size 31154432 by op my_gcn/dense/MatMul action_count 2166818383721 step 12783974653594192978 next 18446744073709551615
2021-08-02 14:17:07.227105: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 67108864
2021-08-02 14:17:07.227229: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f89e6bf080 of size 8399616 by op my_gcn/Cast action_count 2166818383716 step 12783974653594192978 next 12
2021-08-02 14:17:07.227459: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f89eec1b80 of size 25600000 by op my_gcn/dense_1/MatMul action_count 2166818383722 step 12783974653594192978 next 13
2021-08-02 14:17:07.227663: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f8a072bb80 of size 33109248 by op my_gcn/dense_2/MatMul action_count 2166818383723 step 12783974653594192978 next 18446744073709551615
2021-08-02 14:17:07.227828: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 67108864
2021-08-02 14:17:07.227916: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 1f8a832c080 of size 67108864 by op SparseReorder action_count 2166818383712 step 0 next 18446744073709551615
2021-08-02 14:17:07.228289: I tensorflow/core/common_runtime/bfc_allocator.cc:1051] Summary of in-use Chunks by size:
2021-08-02 14:17:07.228453: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 6 Chunks of size 16384 totalling 96.0KiB
2021-08-02 14:17:07.228628: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 400128 totalling 390.8KiB
2021-08-02 14:17:07.228808: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 798720 totalling 780.0KiB
2021-08-02 14:17:07.228988: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 2400000 totalling 2.29MiB
2021-08-02 14:17:07.229158: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 8399616 totalling 8.01MiB
2021-08-02 14:17:07.229325: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 25600000 totalling 24.41MiB
2021-08-02 14:17:07.229522: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 31154432 totalling 29.71MiB
2021-08-02 14:17:07.229686: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 33109248 totalling 31.58MiB
2021-08-02 14:17:07.229869: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 67108864 totalling 64.00MiB
2021-08-02 14:17:07.230026: I tensorflow/core/common_runtime/bfc_allocator.cc:1058] Sum Total of in-use chunks: 161.24MiB
2021-08-02 14:17:07.230176: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] total_region_allocated_bytes_: 169869312 memory_limit_: 68719476736 available bytes: 68549607424 curr_region_allocation_bytes_: 68719476736
2021-08-02 14:17:07.230414: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Stats:
Limit: 68719476736
InUse: 169069312
MaxInUse: 169069312
NumAllocs: 21
MaxAllocSize: 67108864
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2021-08-02 14:17:07.230873: W tensorflow/core/common_runtime/bfc_allocator.cc:467] ******************xx*************************************xxx*********************xxxxxxxxxxxxxxxxxxx
2021-08-02 14:17:07.231094: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at cwise_ops_common.cc:82 : Resource exhausted: OOM when allocating tensor with shape[100000,100000] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator mklcpu
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Program Files\JetBrains\PyCharm 2021.1.3\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "C:\Program Files\JetBrains\PyCharm 2021.1.3\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/david/OneDrive/Desktop/ML/0_TESI/GITHUB/01_Train all/D_Train all.py", line 22, in <module>
Train_graph_regression_1(simulations)
File "C:\Users\david\OneDrive\Desktop\ML\0_TESI\GITHUB\01_Train all\C_Spektral_graph_regression_1.py", line 67, in Train_graph_regression_1
callbacks=[EarlyStopping(patience=patience, restore_best_weights=True)])
File "C:\Users\david\.conda\envs\Tesi2\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1183, in fit
tmp_logs = self.train_function(iterator)
File "C:\Users\david\.conda\envs\Tesi2\lib\site-packages\tensorflow\python\eager\def_function.py", line 889, in __call__
result = self._call(*args, **kwds)
File "C:\Users\david\.conda\envs\Tesi2\lib\site-packages\tensorflow\python\eager\def_function.py", line 950, in _call
return self._stateless_fn(*args, **kwds)
File "C:\Users\david\.conda\envs\Tesi2\lib\site-packages\tensorflow\python\eager\function.py", line 3024, in __call__
filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
File "C:\Users\david\.conda\envs\Tesi2\lib\site-packages\tensorflow\python\eager\function.py", line 1961, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "C:\Users\david\.conda\envs\Tesi2\lib\site-packages\tensorflow\python\eager\function.py", line 596, in call
ctx=ctx)
File "C:\Users\david\.conda\envs\Tesi2\lib\site-packages\tensorflow\python\eager\execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[100000,100000] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator mklcpu
[[node mean_squared_error/SquaredDifference (defined at C:\Users\david\OneDrive\Desktop\ML\0_TESI\GITHUB\01_Train all\C_Spektral_graph_regression_1.py:67) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_train_function_709]
Errors may have originated from an input operation.
Input Source operations connected to node mean_squared_error/SquaredDifference:
my_gcn/dense_3/MatMul (defined at C:\Users\david\OneDrive\Desktop\ML\0_TESI\GITHUB\01_Train all\mygcn.py:116)
Function call stack:
train_function
Thank you
That didn't help as much as I'd hoped :)
Are you running on the latest version of Spektral, installed from source? If not, could you try?
git clone https://github.com/danielegrattarola/spektral.git
cd spektral
python setup.py install # Or 'pip install .'
I expect the memory error to happen with BatchLoader
, but DisjointLoader
should not cause issues (since it keeps the matrix sparse).
Can you also make sure that dataset[0].a
is a Scipy sparse matrix?
Cheers
I was using spektral 1.0.5. With spektral 1.0.7 another error occurs, but it seems different:
2021-08-04 14:04:18.968597: W tensorflow/core/framework/op_kernel.cc:1755] Invalid argument: TypeError: `generator` yielded an element of ((TensorSpec(shape=(177958, 6), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([177958, 177958]), tf.int32), TensorSpec(shape=(177958,), dtype=tf.int64, name=None)), TensorSpec(shape=(177958, 1), dtype=tf.float64, name=None)) where an element of ((TensorSpec(shape=(None, 6), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([None, None]), tf.int32), TensorSpec(shape=(None,), dtype=tf.int64, name=None)), TensorSpec(shape=(None, 177958), dtype=tf.float64, name=None)) was expected.
Traceback (most recent call last):
File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\ops\script_ops.py", line 247, in __call__
return func(device, token, args)
File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\ops\script_ops.py", line 135, in __call__
ret = self._func(*args)
File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 645, in wrapper
return func(*args, **kwargs)
File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py", line 1048, in generator_py_func
"of %s was expected." % (values_spec, output_signature))
TypeError: `generator` yielded an element of ((TensorSpec(shape=(177958, 6), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([177958, 177958]), tf.int32), TensorSpec(shape=(177958,), dtype=tf.int64, name=None)), TensorSpec(shape=(177958, 1), dtype=tf.float64, name=None)) where an element of ((TensorSpec(shape=(None, 6), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([None, None]), tf.int32), TensorSpec(shape=(None,), dtype=tf.int64, name=None)), TensorSpec(shape=(None, 177958), dtype=tf.float64, name=None)) was expected.
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Program Files\JetBrains\PyCharm 2021.1.3\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "C:\Program Files\JetBrains\PyCharm 2021.1.3\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/lorti/Desktop/davide/davide/1_git/D_Train all.py", line 20, in <module>
Train_graph_regression(simulations)
File "C:\Users\lorti\Desktop\davide\davide\1_git\C_Spektral_graph_regression.py", line 68, in Train_graph_regression
callbacks=[EarlyStopping(patience=patience, restore_best_weights=True)])
File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1183, in fit
tmp_logs = self.train_function(iterator)
File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\eager\def_function.py", line 889, in __call__
result = self._call(*args, **kwds)
File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\eager\def_function.py", line 950, in _call
return self._stateless_fn(*args, **kwds)
File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\eager\function.py", line 3024, in __call__
filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\eager\function.py", line 1961, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\eager\function.py", line 596, in call
ctx=ctx)
File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\eager\execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: TypeError: `generator` yielded an element of ((TensorSpec(shape=(177958, 6), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([177958, 177958]), tf.int32), TensorSpec(shape=(177958,), dtype=tf.int64, name=None)), TensorSpec(shape=(177958, 1), dtype=tf.float64, name=None)) where an element of ((TensorSpec(shape=(None, 6), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([None, None]), tf.int32), TensorSpec(shape=(None,), dtype=tf.int64, name=None)), TensorSpec(shape=(None, 177958), dtype=tf.float64, name=None)) was expected.
Traceback (most recent call last):
File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\ops\script_ops.py", line 247, in __call__
return func(device, token, args)
File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\ops\script_ops.py", line 135, in __call__
ret = self._func(*args)
File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 645, in wrapper
return func(*args, **kwargs)
File "C:\Users\lorti\anaconda3\envs\davide_tesi2\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py", line 1048, in generator_py_func
"of %s was expected." % (values_spec, output_signature))
TypeError: `generator` yielded an element of ((TensorSpec(shape=(177958, 6), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([177958, 177958]), tf.int32), TensorSpec(shape=(177958,), dtype=tf.int64, name=None)), TensorSpec(shape=(177958, 1), dtype=tf.float64, name=None)) where an element of ((TensorSpec(shape=(None, 6), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([None, None]), tf.int32), TensorSpec(shape=(None,), dtype=tf.int64, name=None)), TensorSpec(shape=(None, 177958), dtype=tf.float64, name=None)) was expected.
[[{{node EagerPyFunc}}]]
[[IteratorGetNext]] [Op:__inference_train_function_715]
Function call stack:
train_function
I'm using DisjointLoader
.
I checked the adjacency matrix of every graph: they are all Scipy csr_matrix
.
Huh, ok so it seems that the problem is in the model, not the DisjointLoader.
Can you post the code for MyGCN? Or even just the output of model.summary()
?
It seems like your GNN is outputting a final prediction as big as the number of nodes, which is what was causing your OOM probably even before updating to 1.0.7.
Here is MyGCN:
import tensorflow as tf
from tensorflow.keras.layers import Dense
from spektral.layers.convolutional import gcn_conv
class MyGCN(tf.keras.Model):
def __init__(
self,
#n_labels,
channels=16,
activation="tanh",
output_activation="relu",
use_bias=False,
dropout_rate=0.1,
l2_reg=2.5e-4,
n_input_channels=None,
**kwargs,
):
super().__init__(**kwargs)
self.channels = channels
self.activation = activation
self.output_activation = output_activation
self.use_bias = use_bias
self.dropout_rate = dropout_rate
self.l2_reg = l2_reg
self.n_input_channels = n_input_channels
reg = tf.keras.regularizers.l2(l2_reg)
self._gcn0 = gcn_conv.GCNConv(n_input_channels, activation='relu', kernel_regularizer=reg, use_bias=use_bias)
self._dense0= Dense(units=64, activation=activation, use_bias=use_bias)
self._dense1 = Dense(units=64, activation=activation, use_bias=use_bias)
self._dense2 = Dense(units=64, activation=activation, use_bias=use_bias)
self._dense4 = Dense(units=1, activation='linear', use_bias=use_bias)
if tf.version.VERSION < "2.2":
if n_input_channels is None:
raise ValueError("n_input_channels required for tf < 2.2")
x = tf.keras.Input((n_input_channels,), dtype=tf.float32)
a = tf.keras.Input((None,), dtype=tf.float32, sparse=True)
self._set_inputs((x, a))
def get_config(self):
return dict(
#n_labels=self.n_labels,
channels=self.channels,
activation=self.activation,
output_activation=self.output_activation,
use_bias=self.use_bias,
dropout_rate=self.dropout_rate,
l2_reg=self.l2_reg,
n_input_channels=self.n_input_channels)
def call(self, inputs):
if len(inputs) == 2:
x, a = inputs
else:
x, a, _ = inputs
if self.n_input_channels is None:
self.n_input_channels = x.shape[-1]
else:
assert self.n_input_channels == x.shape[-1]
x = self._gcn0([x, a])
x = self._dense0(x)
x = self._dense1(x)
x = self._dense2(x)
return self._dense4(x)
The task is a node regression. Thank you
Hi,
I am having a similar issue, but instead with the built in model spektral.models.gcn.GCN( ). I am just trying to do node classification over multiple graphs, and am using disjoint loader with node_level = True. However, whenever I try to train the model I get a tensorflow input shape error. I am wondering if the above problem was ever resolved, as it is very similar to the one I was having. Thank you!
@nmavesmoore have you tried the solution at the top of the thread?
Yes I have, and it results in the following traceback:
` TypeError: generator yielded an element of ((TensorSpec(shape=(3, 13), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([3, 3]), tf.float64), TensorSpec(shape=(3,), dtype=tf.int64, name=None)), TensorSpec(shape=(3, 1), dtype=tf.float64, name=None)) where an element of ((TensorSpec(shape=(None, 13), dtype=tf.float64, name=None), SparseTensorSpec(TensorShape([None, None]), tf.float64), TensorSpec(shape=(None,), dtype=tf.int64, name=None)), TensorSpec(shape=(None, 305), dtype=tf.float64, name=None)) was expected.
[[{{node EagerPyFunc}}]]
[[IteratorGetNext]] [Op:__inference_train_function_1607]
`
@nmavesmoore From the error it seems that your model expects a target of dimension 305 (TensorSpec(shape=(None, 305), dtype=tf.float64, name=None)
) but your graphs have one-dimensional labels (TensorSpec(shape=(3, 1), dtype=tf.float64, name=None)
).
Could that be a bug on your end?
I checked the dimension of my label and for a given graph it is a one-dimensional list of size (number_of_nodes, ). Once I load it into my custom dataset, the dimensionality is preserved, but once I put the dataset into a disjointloader with node_level = True, and run
batch = loader_tr.__next__()
inputs, target = batch
print(target.shape)
I see that the target (what I assume is my label) is of shape (number_of_nodes, 1).
That extra 1 dimension is expected, but the problem I think is with the model, because it expects a 305-dimensional label instead of a 1-dimensional label. Can you check where that 305 comes from?
I just checked and I think it was coming from my model definition. I call
model = GCN(data_tr.n_labels, channels = 12 )
where data_tr.n_labels = 145. When I set the value to 1 (number of labels at the nodel level) I get another error that is as follows:
File "c:\Users\nathanm\Desktop\projects\2022\NX AI\region-recognition-ai\nxai_env\lib\site-packages\keras\backend.py", line 5283, in binary_crossentropy return tf.nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output) ValueError:
logitsand
labelsmust have the same shape, received ((None, 1) vs (None, 145)).
I'm not sure I follow, without having access to the full code it's difficult to say.
But basically you need to make sure that the number of labels in the data (data_tr.n_labels
), the size of the target in a batch (target.shape[-1]
) and the size of the output of the model are the same.
In the model you posted above, the output is hardcoded to 1 so maybe there's an issue there? Also what I said before,
That extra 1 dimension is expected
is only true if the labels are scalars, but if they are 145-dimensional the last dimension of the output should be 145.
data_tr.n_labels does not equal 1 in this case, but the number of nodes in the largest graph in the dataset. Could that be the issue? That being said, when I hard code model = GCN(1, channels = 12 )
I get a separate error. Leaving it blank leads to an error in calling the model.
Yes, that's probably the issue and possibly related to how you're building the graph data.
That was the problem. Thank you so much!
Hi Daniele and all,
I'm trying to use DisjointLoader in order to train a GCN to make a node-level prediction. My own dataset consists in a list of graphs with different shapes and large sparse adjacency matrices, that I want to use as training set in order to be able to make a prediction at the level of individual nodes on a new test graph. Now, the problem that I have encountered is that when I call
model.fit(DisjointLoader(dataset, node_level=True).load())
this error occurs:TypeError: no supported conversion for types: (dtype('O'),) [[{{node EagerPyFunc}}]] [[IteratorGetNext]] [Op:__inference_train_function_763] Function call stack: train_function
I can't understand wich is the element with
dtype('O')
that the error calls back. Here is a piece of the code I used:Where
Jet_list
generates the list of graphs by reading external data and subclassingspektral.data.Dataset
class. Searching elsewhere the bug seems to be due to incompatible version ofkeras
,tensorflow
andspektral
, but I have no idea about it. Curently I'm usingkeras 2.4.3
,tensorflow 2.5.0
,python 3.7
.Could anyone give me a suggestion?