alibaba / GraphScope

🔨 🍇 💻 🚀 GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba | 一站式图计算系统
https://graphscope.io
Apache License 2.0
3.26k stars 442 forks source link

gcn训练或测试较大数据集时报shapes of all inputs must match #1413

Closed zhixiongning closed 2 years ago

zhixiongning commented 2 years ago

在用gcn训练ogbn-products数据集时,测试集数据达几百万,即便训练验证成功,测试时也会报tensorflow.python.framework.errors_impl.Invalid ArgumentError: Shapes of all inputs must match:values[0].shape=[512,100] !=values[1].shape=[511,100],batch_size为512。 当把测试集减少到一万时,测试成功。同样,当训练集数据量为几百万时,训练一段时间后也会报此问题

acezen commented 2 years ago

谢谢你的issue,为了更方便地找到问题,可以贴下你的测试代码吗?以及更完整的log输出?

zhixiongning commented 2 years ago

脚本是参照example修改的:graph-learn/examples/tf/gcn/train_supervised.py,这个例子用的cora数据集,我们把ogbn-products转换成了相同形式,顶点和边均不带权重,分类数和特征数和数据集匹配,训练点240万,验证点2.4万,测试点2.4万 所有训练,测试,验证的batch_size均设置为512,full_graph_mode False,其余训练参数和example保持一致。 在第一个epoch的iteration到100时报错 报错日志大概如下: File “tensorflow/python/client/session.py”,line1377,in _do_call return fn(*args) File “tensorflow/python/client/session.py”,line1360,in _run_fn return self._call_tf_sessionrun(options, feed_dict,fetch_list, File “tensorflow/python/client/session.py” line1453 in _call_tf_sessionrun return tf_session.TF_SessionRun_wrapper(self._session,options,feed_dict,tensorflow.python.framework.errors_impl.InvalidArgumentError: Shapes of all inputs must match:values[0].shape=[512,100] !=values[1].shape=[511,100] [[{{node Sum/input}}]

Detected at node Sum/input defined at: File “train_supervised.py” trainer.train_and_evaluate() File "graphlearn/python/model/tf/trainer.py" loss,train_iterator=self.model.build() File "graphlearn/python/model/tf/gcn/gcn.py" src_emb=self.encoders['src'].encode(pos_src_ego_tensor) File "graphlearn/python/model/tf/encoders/ego_graph_encoder.py" return self._forward(hiddens) File "graphlearn/python/model/tf/encoders/ego_graph_encoder.py" h=self._conv_layers[layer_idx].forward(src_vecs,) File "graphlearn/python/model/tf/layers/gcn_conv.py update_vecs=tf.reduce_sum([self_vecs,neigh_vecs],axis=0) Node:’Sum/input’ Shapes of all inputs must match:values[0].shape=[512,100] !=values[1].shape=[511,100] [[{{node Sum/input}}]

zhixiongning commented 2 years ago

将训练参数neighs_num不配置成None,限制每跳采样的邻居数,即不会出现该问题。应该是None 导致采样的点太多了。

yecol commented 2 years ago

Thanks for reporting! We will give a reasonable value as default.

acezen commented 2 years ago

@zhixiongning The neighs_num would set a reasonable value when it's None, the commit will release in v0.12.0