Open ShangJP opened 4 years ago
This is most likely due to a mismatch between the current graph and the graph from the checkpoint 感觉是在infer时候的网路和checkpoint的不匹配
您好,我不太明白这个网络是什么意思,可以在详细的解释一下么,我在同一台机器上进行分布式测试时,模型训练完,在hdfs上加载就会报上面的错,但是把模型get到本地久没问题,您能再帮忙分析一下什么原因么?非常感谢
我说的网路是指你的模型,看报错是说你infer时候的模型和checkpoint的模型不匹配
@alinamimi @yangsiran @renyi533 @wenshiyang 您好,我重新进行了测试,发现了一个问题,我在单个节点上作分布式训练, 我的实例机内存配置如下: [ec2-user@ip-172-40-57-160 ~]$ free -h -m total used free shared buff/cache available Mem: 123G 3.9G 104G 772K 15G 118G Swap: 0B 0B 0B
进行了以下两组试验:
1. 训练1246449个节点,训练结束后模型可以正常生成embedding
脚本如下:
ps节点
nohup python -m tf_euler \
--ps_hosts=ip-172-40-57-160.ec2.internal:1999 \
--worker_hosts=ip-172-40-57-160.ec2.internal:2000 \
--model_dir=hdfs://ip-172-40-58-54.ec2.internal:9000/euler/test/line_embedding \
--job_name=ps --task_index=0 > result &
worker节点
nohup python -m tf_euler \
--ps_hosts=ip-172-40-57-160.ec2.internal:1999 \
--worker_hosts=ip-172-40-57-160.ec2.internal:2000 \
--job_name=worker \
--task_index=0 \
--data_dir hdfs://ip-172-40-58-54.ec2.internal:9000/euler/test/ \
--model_dir=hdfs://ip-172-40-58-54.ec2.internal:9000/euler/test/line_embedding \
--euler_zk_addr ip-172-40-93-133.ec2.internal:2181 \
--euler_zk_path /br_em_test_000 \
--max_id 1246449 --learning_rate 0.001 \
--num_epochs 5 \
--xnet_loss True \
--batch_size 160000 \
--log_steps 10 \
--model line --mode train --dim 128 &
save_embedding:
python -m tf_euler \
--ps_hosts=ip-172-40-57-160.ec2.internal:1999 \
--worker_hosts=ip-172-40-57-160.ec2.internal:2000 \
--job_name=worker --task_index=0 \
--data_dir hdfs://ip-172-40-58-54.ec2.internal:9000/euler/test/ \
--model_dir=hdfs://ip-172-40-58-54.ec2.internal:9000/euler/test/line_embedding \
--euler_zk_addr ip-172-40-93-133.ec2.internal:2181 \
--euler_zk_path /br_em_test_000 \
--max_id 1246449 --learning_rate 0.001 \
--num_epochs 5 --xnet_loss True \
--batch_size 160000 --log_steps 20 \
--model line --mode save_embedding --dim 128
生成embedding日志 I1126 07:29:19.592020 42192 remote_graph.cc:106] Retrieve meta info success, shard number: 1 I1126 07:29:19.592037 42192 remote_graph.cc:119] Retrieve meta info success, partition number: 16 I1126 07:29:19.592047 42192 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: node_sum_weight, Meta Info: 1246449.000000 I1126 07:29:19.592054 42192 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: edge_sum_weight, Meta Info: 2019-11-26 07:29:19,761 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false INFO:tensorflow:Graph was finalized. INFO:tensorflow:Restoring parameters from hdfs://ip-172-40-58-54.ec2.internal:9000/euler/test/line_embedding/model.ckpt-35 2019-11-26 07:29:19.985588: I tensorflow/core/distributed_runtime/master_session.cc:1161] Start master session 489eb821d1524253 with config: gpu_options { allow_growth: true } INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. 2019-11-26 07:30:08,755 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2019-11-26 07:30:09,169 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2019-11-26 07:30:09,583 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2019-11-26 07:30:09,968 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2019-11-26 07:30:10,366 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2019-11-26 07:30:11,231 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2. 训练8428196个节点,训练结束后模型不能正常生成embedding,会报错
脚本如下:
ps节点:
nohup python -m tf_euler \
--ps_hosts=ip-172-40-57-160.ec2.internal:1999 \
--worker_hosts=ip-172-40-57-160.ec2.internal:2000 \
--model_dir=hdfs://ip-172-40-58-54.ec2.internal:9000/euler/BR/LINE_embedding \
--job_name=ps --task_index=0 > result &
worker节点:
nohup python -m tf_euler \
--ps_hosts=ip-172-40-57-160.ec2.internal:1999 \
--worker_hosts=ip-172-40-57-160.ec2.internal:2000 \
--job_name=worker --task_index=0 \
--data_dir hdfs://ip-172-40-58-54.ec2.internal:9000/euler/BR/spatk_test_data/\
--model_dir=hdfs://ip-172-40-58-54.ec2.internal:9000/euler/BR/LINE_embedding \
--euler_zk_addr ip-172-40-93-133.ec2.internal:2181 \
--euler_zk_path /br_em_test_000 \
--max_id 8428196 --learning_rate 0.001 \
--num_epochs 5 --xnet_loss True \
--batch_size 320000 --log_steps 10 \
--model line --mode train --dim 128 &
save_embedding:
python -m tf_euler \
--ps_hosts=ip-172-40-57-160.ec2.internal:1999 \
--worker_hosts=ip-172-40-57-160.ec2.internal:2000 \
--job_name=worker --task_index=0 \
--data_dir hdfs://ip-172-40-58-54.ec2.internal:9000/euler/BR/spatk_test_data/ \
--model_dir=hdfs://ip-172-40-58-54.ec2.internal:9000/euler/BR/LINE_embedding \
--euler_zk_addr ip-172-40-93-133.ec2.internal:2181 \
--euler_zk_path /br_em_test_000 \
--max_id 8428196 --learning_rate 0.001 \
--num_epochs 5 --xnet_loss True \
--batch_size 320000 --log_steps 20 \
--model line --mode save_embedding --dim 128
在生成过程中会报错,日志如下:
I1126 07:55:50.460889 44611 zk_server_monitor.cc:238] Online node: 0#172.40.57.160:1475.
I1126 07:55:50.461107 44487 remote_graph.cc:106] Retrieve meta info success, shard number: 1
I1126 07:55:50.461122 44487 remote_graph.cc:119] Retrieve meta info success, partition number: 16
I1126 07:55:50.461133 44487 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: node_sum_weight, Meta Info: 8428196.000000
I1126 07:55:50.461140 44487 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: edge_sum_weight, Meta Info:
2019-11-26 07:55:50,650 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
INFO:tensorflow:Graph was finalized.
2019-11-26 07:55:50,808 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-26 07:55:50,821 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-11-26 07:55:50,823 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
INFO:tensorflow:Restoring parameters from hdfs://ip-172-40-58-54.ec2.internal:9000/euler/BR/LINE_embedding/model.ckpt-130
2019-11-26 07:55:50.869264: I tensorflow/core/distributed_runtime/master_session.cc:1161] Start master session 34680880bb03e835 with config: gpu_options { allow_growth: true }
Traceback (most recent call last):
File "/usr/lib64/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/lib/python2.7/site-packages/euler_gl-0.1.2-py2.7.egg/tf_euler/main.py", line 28, in
Caused by op u'save/RestoreV2', defined at:
File "/usr/lib64/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/lib/python2.7/site-packages/euler_gl-0.1.2-py2.7.egg/tf_euler/main.py", line 28, in
OutOfRangeError (see above for traceback): Read less bytes than requested [[node save/RestoreV2 (defined at /usr/lib/python2.7/site-packages/euler_gl-0.1.2-py2.7.egg/tf_euler/python/run_loop.py:198) = RestoreV2[dtypes=[DT_INT64, DT_FLOAT], _device="/job:ps/replica:0/task:0/device:CPU:0"](_recv_save/Const_0_S1, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
tensorflow.python.framework.errors_impl.OutOfRangeError: Read less bytes than requested 看起来像是这个错误引起的,checkpoint 不完整
模型较大,会导致checkpoint 不完整,是么,因为我刚才进行测试,在8428196个节点上,把维度降到32就可以正常导出,
看报错是checkpoint 不完整,具体原因我不太清楚
好的,感谢您的解答
InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
hdfs://ip-xxx-xx-xx-xx.ec2.internal:9000/euler/BR/single_case_embedding/model.ckpt-52.data-00000-of-00001; Invalid argument [[node save/RestoreV2 (defined at python/run_loop.py:153) = RestoreV2[dtypes=[DT_INT64, DT_FLOAT], _device="/job:ps/replica:0/task:0/device:CPU:0"](_recv_save/Const_0_S1, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]] 出现了和你这个一样的问题,也是把训练数据缩小之后就不会有这个错误了,请教一下最后是怎么解决的呢?感觉ckpt太大就可能出现这种问题。 我好像没有报OutOfRangeError (see above for traceback): Read less bytes than requested这类错误
把训练数据缩小之后就不会有这个错误了,感觉不是你训练数据太大,是因为训练数据里面有超过范围的数
你可以定位一下,具体是哪一条训练数据引起的错误
模型较大,会导致checkpoint 不完整,是么,因为我刚才进行测试,在8428196个节点上,把维度降到32就可以正常导出,
这个问题你后来怎么解决的呢
这个问题确实存在,可以麻烦官方在大数据集(10亿+变量)上测测,而且save_emb方法,内存占用极大,当前增量输出数据 这个问题就搞定了。感觉官方这块儿代码也改改。
如果数据量大,在save Embedding的时候需要的tf worker也必须多,每个worker实际需要infer的Embedding数量是可接受的。不能用一个worker去infer所有的Embedding,一个是时间慢,一个是内存会爆
我刚才提交了一个commit 帮看下。增量save ,麻烦帮review下,实测46G emb, 单worker 只要不到3g内存。
这个问题确实存在,可以麻烦官方在大数据集(10亿+变量)上测测,而且save_emb方法,内存占用极大,当前增量输出数据 这个问题就搞定了。感觉官方这块儿代码也改改。
@lixusign @ShangJP @siallen @alinamimi 请问大数据集下save embedding 加载模型失败的问题解决了吗?
解决了 ,看我提交的那个pull request 即可。 自己把代码黏贴过去就好用就好
解决了 ,看我提交的那个pull request 即可。 自己把代码黏贴过去就好用就好
我试了,还是不行;你的pr逻辑是将save embedding改为增量,但我的问题和题主的问题一样,模型restore的时候失败,提示模型不匹配,但dump的模型参数check过没有问题;错误信息如下 InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error: hdfs://bjlt-rs367.sy:8888/home/ad/data/graph_data/merchant_p5s_graph_monthnew/model/model.ckpt-5477704.data-00001-of-00002; Invalid argument [[node save/RestoreV2_1 (defined at run_loop.py:210) = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/replica:0/task:1/device:CPU:0"](_recv_save/Const_0_S3, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]
我在这方面没遇到相似问题,如果不是模型有调整,是不是save的时候的tf版本或者某些库和train的版本不同导致的。
我在这方面没遇到相似问题,如果不是模型有调整,是不是save的时候的tf版本或者某些库和train的版本不同导致的。
只和图数据量有关,减少图规模就ok。。
InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
hdfs://ip-xxx-xx-xx-xx.ec2.internal:9000/euler/BR/single_case_embedding/model.ckpt-52.data-00000-of-00001; Invalid argument [[node save/RestoreV2 (defined at python/run_loop.py:153) = RestoreV2[dtypes=[DT_INT64, DT_FLOAT], _device="/job:ps/replica:0/task:0/device:CPU:0"](_recv_save/Const_0_S1, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]