danielegrattarola / spektral

Graph Neural Networks with Keras and Tensorflow 2.
https://graphneural.network
MIT License
2.37k stars 334 forks source link

GATConv compatibility with Disjoint mode? #282

Open edshui opened 3 years ago

edshui commented 3 years ago

Dear Experts:

I was trying to use GATConv in disjoint mode with disjoint data loader. But when I run model.fit, I got the following error that I couldn't figure out how to solve at all: image

Any help would be greatly appreciated, thank you very much! Ed

danielegrattarola commented 3 years ago

Hi,

can you post a minimal example to reproduce the issue?

Thanks

edshui commented 3 years ago

Hi Daniele:

Absolutely. Please see below:

class GConn(Model):
    def __init__(self, N, n_out, n_layers, activation="relu", dropout=None):        
        self.gins = []
        for _ in range(n_layers):
            self.gins.append(
                GINConv(n_out, epsilon=0, mlp_hidden=[N,n_out,n_out],activation="relu")
            )
        self.gcn = GCNConv(
            n_out,
            activation="relu"
        )
        self.gats = []
        for _ in range(n_layers):
            self.gats.append(
                GATConv(
                    n_out,attn_heads=8,add_self_loops=False,
                    concat_heads=False,dropout_rate=0.5,
                    activation="relu"
                )
            )        

    def call(self, inputs):
        outs = []
        x, a, _, *ax_aa_s = inputs        
        for idx,gin in enumerate(self.gins):
            x1 = gin([ax_aa_s[2*idx], ax_aa_s[2*idx+1]])               
            outs.append(x1)
        for idx,gat in enumerate(self.gats):
            x2 = gat([ax_aa_s[2*idx], ax_aa_s[2*idx+1]])               
            outs.append(x2)
        x3 = self.gcn([x, a])        
        outs.append(x3)

        if len(outs)>1:
            out = Concatenate(axis=-1)(outs)        
            out = self.dense(out)  
        else:
            out = x1

        return self.acti(out)

# Build model
model = GConn(N,n_out,len(adj_ran),activation=oACTI) # Model(inputs=[x_in, a_in], outputs=out)
opt = Adam(lr=learning_rate)
loss_fn = MeanSquaredError() #CategoricalCrossentropy()
metrics = MeanAbsoluteError() #MeanSquaredError()

The problem lies in x3 as the code would have run fine without self.gats.

Many thanks! Ed

Hi,

can you post a minimal example to reproduce the issue?

Thanks

danielegrattarola commented 3 years ago

Hi,

sorry, I just took the time to look at this code. I'm not too sure what's going on here:

x, a, _, *ax_aa_s = inputs  

since it seems that you have a model with non-standard inputs (standard would be simply node features and adjacency matrix) and I would need to see the arrays/tensors that you feed to the model when training.

Also, can you post the full stack trace so that I get a sense of where the error is happening in the GAT layer?

Thanks

edshui commented 3 years ago

Hi Daniele:

Thanks for getting back to me indeed!

The reason I have multiple adjacency matrices is because they are from the same adjacency matrix but masked with different threshold. They are then fed to different GAT layers. The output from GAT layers are then concatenated together, and pass to a dense layer (in case you have the time, please see the bottom for my implementation of the disjointloader that takes a list of adjacency matrices as input).

I would like to mention that the codes work fine if I only have the GCN and GIN layers. But it failed when I add GAT layers.

Below please find the full stack trace for your reference:

Epoch 1/600
WARNING:tensorflow:AutoGraph could not transform <bound method GConn.call of <__main__.GConn object at 0x7f1917545220>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <bound method GConn.call of <__main__.GConn object at 0x7f1917545220>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <bound method MessagePassing.propagate of <spektral.layers.convolutional.gin_conv.GINConv object at 0x7f191754c550>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <bound method MessagePassing.propagate of <spektral.layers.convolutional.gin_conv.GINConv object at 0x7f191754c550>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <bound method GATConv.call of <spektral.layers.convolutional.gat_conv.GATConv object at 0x7f18e96e82e0>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <bound method GATConv.call of <spektral.layers.convolutional.gat_conv.GATConv object at 0x7f18e96e82e0>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <bound method GCNConv.call of <spektral.layers.convolutional.gcn_conv.GCNConv object at 0x7f1916e60340>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <bound method GCNConv.call of <spektral.layers.convolutional.gcn_conv.GCNConv object at 0x7f1916e60340>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-8-e3d69539cea3> in <module>
     34               metrics=["mse"]) #["mse"])
     35 
---> 36 history = model.fit(
     37     loader_tr.load(),
     38     steps_per_epoch=loader_tr.steps_per_epoch,

~/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1098                 _r=1):
   1099               callbacks.on_train_batch_begin(step)
-> 1100               tmp_logs = self.train_function(iterator)
   1101               if data_handler.should_sync:
   1102                 context.async_wait()

~/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py in __call__(self, *args, **kwds)
    826     tracing_count = self.experimental_get_tracing_count()
    827     with trace.Trace(self._name) as tm:
--> 828       result = self._call(*args, **kwds)
    829       compiler = "xla" if self._experimental_compile else "nonXla"
    830       new_tracing_count = self.experimental_get_tracing_count()

~/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py in _call(self, *args, **kwds)
    869       # This is the first call of __call__, so we have to initialize.
    870       initializers = []
--> 871       self._initialize(args, kwds, add_initializers_to=initializers)
    872     finally:
    873       # At this point we know that the initialization is complete (or less

~/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py in _initialize(self, args, kwds, add_initializers_to)
    723     self._graph_deleter = FunctionDeleter(self._lifted_initializer_graph)
    724     self._concrete_stateful_fn = (
--> 725         self._stateful_fn._get_concrete_function_internal_garbage_collected(  # pylint: disable=protected-access
    726             *args, **kwds))
    727 

~/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/eager/function.py in _get_concrete_function_internal_garbage_collected(self, *args, **kwargs)
   2967       args, kwargs = None, None
   2968     with self._lock:
-> 2969       graph_function, _ = self._maybe_define_function(args, kwargs)
   2970     return graph_function
   2971 

~/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/eager/function.py in _maybe_define_function(self, args, kwargs)
   3359 
   3360           self._function_cache.missed.add(call_context_key)
-> 3361           graph_function = self._create_graph_function(args, kwargs)
   3362           self._function_cache.primary[cache_key] = graph_function
   3363 

~/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/eager/function.py in _create_graph_function(self, args, kwargs, override_flat_arg_shapes)
   3194     arg_names = base_arg_names + missing_arg_names
   3195     graph_function = ConcreteFunction(
-> 3196         func_graph_module.func_graph_from_py_func(
   3197             self._name,
   3198             self._python_function,

~/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/framework/func_graph.py in func_graph_from_py_func(name, python_func, args, kwargs, signature, func_graph, autograph, autograph_options, add_control_dependencies, arg_names, op_return_value, collections, capture_by_value, override_flat_arg_shapes)
    988         _, original_func = tf_decorator.unwrap(python_func)
    989 
--> 990       func_outputs = python_func(*func_args, **func_kwargs)
    991 
    992       # invariant: `func_outputs` contains only Tensors, CompositeTensors,

~/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py in wrapped_fn(*args, **kwds)
    632             xla_context.Exit()
    633         else:
--> 634           out = weak_wrapped_fn().__wrapped__(*args, **kwds)
    635         return out
    636 

~/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/framework/func_graph.py in wrapper(*args, **kwargs)
    975           except Exception as e:  # pylint:disable=broad-except
    976             if hasattr(e, "ag_error_metadata"):
--> 977               raise e.ag_error_metadata.to_exception(e)
    978             else:
    979               raise

NotImplementedError: in user code:

    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/keras/engine/training.py:805 train_function  *
        return step_function(self, iterator)
    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/keras/engine/training.py:795 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/distribute/distribute_lib.py:1259 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/distribute/distribute_lib.py:2730 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/distribute/distribute_lib.py:3417 _call_for_each_replica
        return fn(*args, **kwargs)
    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/keras/engine/training.py:788 run_step  **
        outputs = model.train_step(data)
    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/keras/engine/training.py:757 train_step
        self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:496 minimize
        grads_and_vars = self._compute_gradients(
    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:548 _compute_gradients
        grads_and_vars = self._get_gradients(tape, loss, var_list, grad_loss)
    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:441 _get_gradients
        grads = tape.gradient(loss, var_list, grad_loss)
    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/eager/backprop.py:1080 gradient
        flat_grad = imperative_grad.imperative_grad(
    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/eager/imperative_grad.py:71 imperative_grad
        return pywrap_tfe.TFE_Py_TapeGradient(
    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/eager/backprop.py:162 _gradient_function
        return grad_fn(mock_op, *out_grads)
    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/ops/math_grad.py:473 _UnsortedSegmentSumGrad
        return _GatherDropNegatives(grad, op.inputs[1])[0], None, None
    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/ops/math_grad.py:439 _GatherDropNegatives
        array_ops.ones([array_ops.rank(gathered)
    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:201 wrapper
        return target(*args, **kwargs)
    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/ops/array_ops.py:3120 ones
        output = _constant_if_small(one, shape, dtype, name)
    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/ops/array_ops.py:2804 _constant_if_small
        if np.prod(shape) < 1000:
    <__array_function__ internals>:5 prod

    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/numpy/core/fromnumeric.py:3030 prod
        return _wrapreduction(a, np.multiply, 'prod', axis, dtype, out,
    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/numpy/core/fromnumeric.py:87 _wrapreduction
        return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
    /home/ehui/anaconda3/envs/hcp/lib/python3.9/site-packages/tensorflow/python/framework/ops.py:852 __array__
        raise NotImplementedError(

    NotImplementedError: Cannot convert a symbolic Tensor (gradient_tape/g_conn/gat_conv/sub:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported

Below is my disjointloader implementation, which is a subclass of your DisjointLoader:

class HCPDisjointLoader(DisjointLoader):
    def __init__(
        self, dataset, node_level=False, batch_size=1, epochs=None, shuffle=True
        ):        
        self.dataset2 = dataset[1:]        
        super().__init__(dataset[0], node_level=node_level, batch_size=batch_size, epochs=epochs, shuffle=shuffle)        
        self._HCPgenerator = [self.HCPgenerator(i) for i in range(len(self.dataset2))]

    def __next__(self):
        nxt = self._generator.__next__()
        nxt2 = [gen.__next__() for gen in self._HCPgenerator]
        return self.collate(nxt,nxt2)        

    def HCPgenerator(self,idx):
        return batch_generator(
            self.dataset2[idx],
            batch_size=self.batch_size,
            epochs=self.epochs,
            shuffle=self.shuffle,
        )

    def collate(self, batch, batch2):
        output, y = super().collate(batch)
        output = list(output)

        for ba in batch2:
            out,_ = super().collate(ba)
            out = list(out)
            output = output + out[:2]     
        output=tuple(output)        
        return output, y

    def tf_signature(self):        
        n_layers = len(self.dataset2)

        signature = self.dataset.signature
        signature2 = self.dataset2[0].signature
        if "y" in signature:
            signature["y"]["shape"] = prepend_none(signature["y"]["shape"])
        if "a" in signature:
            signature["a"]["spec"] = tf.SparseTensorSpec

        signature["i"] = dict()
        signature["i"]["spec"] = tf.TensorSpec
        signature["i"]["shape"] = (None,)
        signature["i"]["dtype"] = tf.as_dtype(tf.int64)

        for idx in range(n_layers):
            x_str='x'+str(idx+2)
            a_str='a'+str(idx+2)
            signature[x_str] = signature2['x']
            signature[a_str] = signature2['a']        

        return to_tf_signature(signature,n_layers)

adataset_tr = []
adataset_va = []
for thres in adj_ran:
    tdataset = HCPDataset([ax,tadj>thres,y])    
    tdataset_tr, tdataset_va = tdataset[idx_tr], tdataset[idx_va]
    adataset_tr.append(tdataset_tr)
    adataset_va.append(tdataset_va)

dataset = HCPDataset([x,adj,y])
dataset_tr, dataset_va = dataset[idx_tr], dataset[idx_va]

loader_tr = HCPDisjointLoader([dataset_tr,*adataset_tr], batch_size=batch_size, epochs=epochs, node_level=True)
loader_va = HCPDisjointLoader([dataset_va,*adataset_va], batch_size=batch_size, node_level=True)

Thanks so much for your help and time!

Ed

Hi,

sorry, I just took the time to look at this code. I'm not too sure what's going on here:

x, a, _, *ax_aa_s = inputs  

since it seems that you have a model with non-standard inputs (standard would be simply node features and adjacency matrix) and I would need to see the arrays/tensors that you feed to the model when training.

Also, can you post the full stack trace so that I get a sense of where the error is happening in the GAT layer?

Thanks

edshui commented 2 years ago

Hi Daniele:

May I wonder if you had a chance to take a look at the trace above?

Your help is much appreciated! Thanks!

Ed

danielegrattarola commented 2 years ago

Hi Ed,

I have looked at the code and stack trace, but unfortunately it didn't help. Can you re-run your code, but this time add the following line at the top of the main script:

tf.config.run_functions_eagerly(True)

?

This should give a stack trace that tells us where the problem happens, so we can debug it. Also, if you were able to reproduce the issue in a more "standard" setting that would be great, this issue might also have something to do with the custom loader.

Thanks Daniele

edshui commented 2 years ago

Hi Daniele:

Many thanks for getting back to me despite your busy schedule.

Interestingly, the scripts run when I added tf.config.run_functions_eagerly(True), do you know what's going on (please excuse my ignorance)?

Many thanks, Ed

Hi Ed,

I have looked at the code and stack trace, but unfortunately it didn't help. Can you re-run your code, but this time add the following line at the top of the main script:

tf.config.run_functions_eagerly(True)

?

This should give a stack trace that tells us where the problem happens, so we can debug it. Also, if you were able to reproduce the issue in a more "standard" setting that would be great, this issue might also have something to do with the custom loader.

Thanks Daniele

danielegrattarola commented 2 years ago

Honestly, I have no idea :D I would need to run the code in a debugger, with your data, to see what input/array is causing the crash in graph mode.

Note that this solution is not optimal, since eager mode will run slower.

edshui commented 2 years ago

Hi Daniele:

No worries, let me try my best to figure out what went wrong;)

It may have to do with what you mentioned previously (my custom loader). Will give you posted with updates.

Many thanks! Ed

Honestly, I have no idea :D I would need to run the code in a debugger, with your data, to see what input/array is causing the crash in graph mode.

Note that this solution is not optimal, since eager mode will run slower.