alibaba / euler

A distributed graph deep learning framework.
Apache License 2.0
2.89k stars 559 forks source link

您好,在进行完分布式训练后,导出模型时报出以下错误,请问一下该如何解决啊 #182

Open ShangJP opened 4 years ago

ShangJP commented 4 years ago

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

hdfs://ip-xxx-xx-xx-xx.ec2.internal:9000/euler/BR/single_case_embedding/model.ckpt-52.data-00000-of-00001; Invalid argument [[node save/RestoreV2 (defined at python/run_loop.py:153) = RestoreV2[dtypes=[DT_INT64, DT_FLOAT], _device="/job:ps/replica:0/task:0/device:CPU:0"](_recv_save/Const_0_S1, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

alinamimi commented 4 years ago

This is most likely due to a mismatch between the current graph and the graph from the checkpoint 感觉是在infer时候的网路和checkpoint的不匹配

ShangJP commented 4 years ago

您好,我不太明白这个网络是什么意思,可以在详细的解释一下么,我在同一台机器上进行分布式测试时,模型训练完,在hdfs上加载就会报上面的错,但是把模型get到本地久没问题,您能再帮忙分析一下什么原因么?非常感谢

alinamimi commented 4 years ago

我说的网路是指你的模型,看报错是说你infer时候的模型和checkpoint的模型不匹配

ShangJP commented 4 years ago

@alinamimi @yangsiran @renyi533 @wenshiyang 您好,我重新进行了测试,发现了一个问题,我在单个节点上作分布式训练, 我的实例机内存配置如下: [ec2-user@ip-172-40-57-160 ~]$ free -h -m total used free shared buff/cache available Mem: 123G 3.9G 104G 772K 15G 118G Swap: 0B 0B 0B

进行了以下两组试验:

1. 训练1246449个节点,训练结束后模型可以正常生成embedding
    脚本如下:
        ps节点
        nohup python -m tf_euler \
          --ps_hosts=ip-172-40-57-160.ec2.internal:1999 \
          --worker_hosts=ip-172-40-57-160.ec2.internal:2000 \
          --model_dir=hdfs://ip-172-40-58-54.ec2.internal:9000/euler/test/line_embedding \
          --job_name=ps --task_index=0 > result &
        worker节点
        nohup python -m tf_euler \
          --ps_hosts=ip-172-40-57-160.ec2.internal:1999 \
          --worker_hosts=ip-172-40-57-160.ec2.internal:2000 \
          --job_name=worker \
          --task_index=0 \
          --data_dir hdfs://ip-172-40-58-54.ec2.internal:9000/euler/test/ \
          --model_dir=hdfs://ip-172-40-58-54.ec2.internal:9000/euler/test/line_embedding \
          --euler_zk_addr ip-172-40-93-133.ec2.internal:2181 \
          --euler_zk_path /br_em_test_000 \
          --max_id  1246449  --learning_rate 0.001 \
          --num_epochs 5 \
          --xnet_loss True \
          --batch_size 160000 \
          --log_steps 10 \
          --model line  --mode train  --dim 128 &
        save_embedding:
        python -m tf_euler \
          --ps_hosts=ip-172-40-57-160.ec2.internal:1999 \
          --worker_hosts=ip-172-40-57-160.ec2.internal:2000 \
          --job_name=worker --task_index=0 \
          --data_dir hdfs://ip-172-40-58-54.ec2.internal:9000/euler/test/ \
          --model_dir=hdfs://ip-172-40-58-54.ec2.internal:9000/euler/test/line_embedding \
          --euler_zk_addr ip-172-40-93-133.ec2.internal:2181 \
          --euler_zk_path /br_em_test_000 \
          --max_id  1246449 --learning_rate 0.001 \
          --num_epochs 5 --xnet_loss True \
          --batch_size 160000 --log_steps 20 \
          --model line  --mode save_embedding  --dim 128

生成embedding日志 I1126 07:29:19.592020 42192 remote_graph.cc:106] Retrieve meta info success, shard number: 1 I1126 07:29:19.592037 42192 remote_graph.cc:119] Retrieve meta info success, partition number: 16 I1126 07:29:19.592047 42192 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: node_sum_weight, Meta Info: 1246449.000000 I1126 07:29:19.592054 42192 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: edge_sum_weight, Meta Info: 2019-11-26 07:29:19,761 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false INFO:tensorflow:Graph was finalized. INFO:tensorflow:Restoring parameters from hdfs://ip-172-40-58-54.ec2.internal:9000/euler/test/line_embedding/model.ckpt-35 2019-11-26 07:29:19.985588: I tensorflow/core/distributed_runtime/master_session.cc:1161] Start master session 489eb821d1524253 with config: gpu_options { allow_growth: true } INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. 2019-11-26 07:30:08,755 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2019-11-26 07:30:09,169 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2019-11-26 07:30:09,583 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2019-11-26 07:30:09,968 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2019-11-26 07:30:10,366 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2019-11-26 07:30:11,231 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false

2. 训练8428196个节点,训练结束后模型不能正常生成embedding,会报错
    脚本如下:
        ps节点:
        nohup python -m tf_euler \
          --ps_hosts=ip-172-40-57-160.ec2.internal:1999 \
          --worker_hosts=ip-172-40-57-160.ec2.internal:2000 \
          --model_dir=hdfs://ip-172-40-58-54.ec2.internal:9000/euler/BR/LINE_embedding \
          --job_name=ps --task_index=0 > result &
        worker节点:
        nohup python -m tf_euler \
          --ps_hosts=ip-172-40-57-160.ec2.internal:1999 \
          --worker_hosts=ip-172-40-57-160.ec2.internal:2000 \
          --job_name=worker --task_index=0 \
          --data_dir hdfs://ip-172-40-58-54.ec2.internal:9000/euler/BR/spatk_test_data/\
          --model_dir=hdfs://ip-172-40-58-54.ec2.internal:9000/euler/BR/LINE_embedding \
          --euler_zk_addr ip-172-40-93-133.ec2.internal:2181 \
          --euler_zk_path /br_em_test_000 \
          --max_id  8428196  --learning_rate 0.001 \
          --num_epochs 5  --xnet_loss True \
          --batch_size 320000 --log_steps 10 \
          --model line  --mode train  --dim 128 &
        save_embedding:
        python -m tf_euler \
          --ps_hosts=ip-172-40-57-160.ec2.internal:1999 \
          --worker_hosts=ip-172-40-57-160.ec2.internal:2000 \
          --job_name=worker --task_index=0 \
          --data_dir hdfs://ip-172-40-58-54.ec2.internal:9000/euler/BR/spatk_test_data/ \
          --model_dir=hdfs://ip-172-40-58-54.ec2.internal:9000/euler/BR/LINE_embedding \
          --euler_zk_addr ip-172-40-93-133.ec2.internal:2181 \
          --euler_zk_path /br_em_test_000 \
          --max_id  8428196 --learning_rate 0.001 \
          --num_epochs 5 --xnet_loss True \
          --batch_size 320000 --log_steps 20 \
          --model line  --mode save_embedding  --dim 128  

        在生成过程中会报错,日志如下:

I1126 07:55:50.460889 44611 zk_server_monitor.cc:238] Online node: 0#172.40.57.160:1475. I1126 07:55:50.461107 44487 remote_graph.cc:106] Retrieve meta info success, shard number: 1 I1126 07:55:50.461122 44487 remote_graph.cc:119] Retrieve meta info success, partition number: 16 I1126 07:55:50.461133 44487 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: node_sum_weight, Meta Info: 8428196.000000 I1126 07:55:50.461140 44487 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: edge_sum_weight, Meta Info: 2019-11-26 07:55:50,650 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false INFO:tensorflow:Graph was finalized. 2019-11-26 07:55:50,808 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2019-11-26 07:55:50,821 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2019-11-26 07:55:50,823 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false INFO:tensorflow:Restoring parameters from hdfs://ip-172-40-58-54.ec2.internal:9000/euler/BR/LINE_embedding/model.ckpt-130 2019-11-26 07:55:50.869264: I tensorflow/core/distributed_runtime/master_session.cc:1161] Start master session 34680880bb03e835 with config: gpu_options { allow_growth: true } Traceback (most recent call last): File "/usr/lib64/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/lib/python2.7/site-packages/euler_gl-0.1.2-py2.7.egg/tf_euler/main.py", line 28, in tf.app.run(run_loop.main) File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/usr/lib/python2.7/site-packages/euler_gl-0.1.2-py2.7.egg/tf_euler/python/run_loop.py", line 403, in main run_distributed(flags_obj, run_network_embedding) File "/usr/lib/python2.7/site-packages/euler_gl-0.1.2-py2.7.egg/tf_euler/python/run_loop.py", line 395, in run_distributed run(flags_obj, server.target, flags_obj.task_index == 0) File "/usr/lib/python2.7/site-packages/euler_gl-0.1.2-py2.7.egg/tf_euler/python/run_loop.py", line 361, in run_network_embedding run_save_embedding(model, flags_obj, master, is_chief) File "/usr/lib/python2.7/site-packages/euler_gl-0.1.2-py2.7.egg/tf_euler/python/run_loop.py", line 198, in run_save_embedding config=config) as sess: File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession stop_grace_period_secs=stop_grace_period_secs) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 921, in init stop_grace_period_secs=stop_grace_period_secs) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 643, in init self._sess = _RecoverableSession(self._coordinated_creator) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1107, in init _WrappedSession.init(self, self._create_session()) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session return self._sess_creator.create_session() File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session self.tf_sess = self._session_creator.create_session() File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 566, in create_session init_fn=self._scaffold.init_fn) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 288, in prepare_session config=config) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 218, in _restore_checkpoint saver.restore(sess, ckpt.model_checkpoint_path) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1546, in restore {self.saver_def.filename_tensor_name: save_path}) File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.OutOfRangeError: Read less bytes than requested [[node save/RestoreV2 (defined at /usr/lib/python2.7/site-packages/euler_gl-0.1.2-py2.7.egg/tf_euler/python/run_loop.py:198) = RestoreV2[dtypes=[DT_INT64, DT_FLOAT], _device="/job:ps/replica:0/task:0/device:CPU:0"](_recv_save/Const_0_S1, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Caused by op u'save/RestoreV2', defined at: File "/usr/lib64/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/lib/python2.7/site-packages/euler_gl-0.1.2-py2.7.egg/tf_euler/main.py", line 28, in tf.app.run(run_loop.main) File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/usr/lib/python2.7/site-packages/euler_gl-0.1.2-py2.7.egg/tf_euler/python/run_loop.py", line 403, in main run_distributed(flags_obj, run_network_embedding) File "/usr/lib/python2.7/site-packages/euler_gl-0.1.2-py2.7.egg/tf_euler/python/run_loop.py", line 395, in run_distributed run(flags_obj, server.target, flags_obj.task_index == 0) File "/usr/lib/python2.7/site-packages/euler_gl-0.1.2-py2.7.egg/tf_euler/python/run_loop.py", line 361, in run_network_embedding run_save_embedding(model, flags_obj, master, is_chief) File "/usr/lib/python2.7/site-packages/euler_gl-0.1.2-py2.7.egg/tf_euler/python/run_loop.py", line 198, in run_save_embedding config=config) as sess: File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession stop_grace_period_secs=stop_grace_period_secs) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 921, in init stop_grace_period_secs=stop_grace_period_secs) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 643, in init self._sess = _RecoverableSession(self._coordinated_creator) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1107, in init _WrappedSession.init(self, self._create_session()) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session return self._sess_creator.create_session() File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session self.tf_sess = self._session_creator.create_session() File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 557, in create_session self._scaffold.finalize() File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 213, in finalize self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 886, in _get_saver_or_default saver = Saver(sharded=True, allow_empty=True) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1102, in init self.build() File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1114, in build self._build(self._filename, build_save=True, build_restore=True) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1151, in _build build_save=build_save, build_restore=build_restore) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 789, in _build_internal restore_sequentially, reshape) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 459, in _AddShardedRestoreOps name="restore_shard")) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 406, in _AddRestoreOps restore_sequentially) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 862, in bulk_restore return io_ops.restore_v2(filename_tensor, names, slices, dtypes) File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1466, in restore_v2 shape_and_slices=shape_and_slices, dtypes=dtypes, name=name) File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op op_def=op_def) File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in init self._traceback = tf_stack.extract_stack()

OutOfRangeError (see above for traceback): Read less bytes than requested [[node save/RestoreV2 (defined at /usr/lib/python2.7/site-packages/euler_gl-0.1.2-py2.7.egg/tf_euler/python/run_loop.py:198) = RestoreV2[dtypes=[DT_INT64, DT_FLOAT], _device="/job:ps/replica:0/task:0/device:CPU:0"](_recv_save/Const_0_S1, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

alinamimi commented 4 years ago

tensorflow.python.framework.errors_impl.OutOfRangeError: Read less bytes than requested 看起来像是这个错误引起的,checkpoint 不完整

ShangJP commented 4 years ago

模型较大,会导致checkpoint 不完整,是么,因为我刚才进行测试,在8428196个节点上,把维度降到32就可以正常导出,

alinamimi commented 4 years ago

看报错是checkpoint 不完整,具体原因我不太清楚

ShangJP commented 4 years ago

好的,感谢您的解答

siallen commented 4 years ago

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

hdfs://ip-xxx-xx-xx-xx.ec2.internal:9000/euler/BR/single_case_embedding/model.ckpt-52.data-00000-of-00001; Invalid argument [[node save/RestoreV2 (defined at python/run_loop.py:153) = RestoreV2[dtypes=[DT_INT64, DT_FLOAT], _device="/job:ps/replica:0/task:0/device:CPU:0"](_recv_save/Const_0_S1, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]] 出现了和你这个一样的问题,也是把训练数据缩小之后就不会有这个错误了,请教一下最后是怎么解决的呢?感觉ckpt太大就可能出现这种问题。 我好像没有报OutOfRangeError (see above for traceback): Read less bytes than requested这类错误

alinamimi commented 4 years ago

把训练数据缩小之后就不会有这个错误了,感觉不是你训练数据太大,是因为训练数据里面有超过范围的数

alinamimi commented 4 years ago

你可以定位一下,具体是哪一条训练数据引起的错误

poryfly commented 4 years ago

模型较大,会导致checkpoint 不完整,是么,因为我刚才进行测试,在8428196个节点上,把维度降到32就可以正常导出,

这个问题你后来怎么解决的呢

lixusign commented 4 years ago

这个问题确实存在,可以麻烦官方在大数据集(10亿+变量)上测测,而且save_emb方法,内存占用极大,当前增量输出数据 这个问题就搞定了。感觉官方这块儿代码也改改。

alinamimi commented 4 years ago

如果数据量大,在save Embedding的时候需要的tf worker也必须多,每个worker实际需要infer的Embedding数量是可接受的。不能用一个worker去infer所有的Embedding,一个是时间慢,一个是内存会爆

lixusign commented 4 years ago

我刚才提交了一个commit 帮看下。增量save ,麻烦帮review下,实测46G emb, 单worker 只要不到3g内存。

wsnooker commented 4 years ago

这个问题确实存在,可以麻烦官方在大数据集(10亿+变量)上测测,而且save_emb方法,内存占用极大,当前增量输出数据 这个问题就搞定了。感觉官方这块儿代码也改改。

@lixusign @ShangJP @siallen @alinamimi 请问大数据集下save embedding 加载模型失败的问题解决了吗?

lixusign commented 4 years ago

解决了 ,看我提交的那个pull request 即可。 自己把代码黏贴过去就好用就好

wsnooker commented 4 years ago

解决了 ,看我提交的那个pull request 即可。 自己把代码黏贴过去就好用就好

我试了,还是不行;你的pr逻辑是将save embedding改为增量,但我的问题和题主的问题一样,模型restore的时候失败,提示模型不匹配,但dump的模型参数check过没有问题;错误信息如下 InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error: hdfs://bjlt-rs367.sy:8888/home/ad/data/graph_data/merchant_p5s_graph_monthnew/model/model.ckpt-5477704.data-00001-of-00002; Invalid argument [[node save/RestoreV2_1 (defined at run_loop.py:210) = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/replica:0/task:1/device:CPU:0"](_recv_save/Const_0_S3, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]

lixusign commented 4 years ago

我在这方面没遇到相似问题,如果不是模型有调整,是不是save的时候的tf版本或者某些库和train的版本不同导致的。

wsnooker commented 4 years ago

我在这方面没遇到相似问题,如果不是模型有调整,是不是save的时候的tf版本或者某些库和train的版本不同导致的。

只和图数据量有关,减少图规模就ok。。