遇到线上tdm serving evaluation结果和线下模型predict评测结果差异巨大

alibaba / x-deeplearning

An industrial deep learning framework for high-dimension sparse data

Apache License 2.0

4.26k stars 1.03k forks source link

遇到线上tdm serving evaluation结果和线下模型predict评测结果差异巨大 #299

Open cuisonghui opened 4 years ago

cuisonghui commented 4 years ago

目前使用真实生产数据，tdm att的线上和线下(分布式)流程都跑通了(cpu版本) 线下部分使用的是:x-deeplearning/xdl-algorithm-solution/TDM/script/tdm_ub_att_ubuntu 线上部分模型转换使用的是:x-deeplearning/blaze/tools/example_model/tdm_att 线上评估使用的是:x-deeplearning/xdl-algorithm-solution/TDMServing/evaluation

线下离线模型predict结果为:

predict result:
        global_sample_num: 2548
        global_r_num: 3794
        global_gt_num: 40368
        global_p_num: 254800
        global_r: 0.093985
        global_p: 0.014890
        avg_r: 0.115653
        avg_p: 0.014890

上周用tdm serving evaluation跑同样的测试集结果基本不可用

avg_precision: 0.00141051
avg_recall: 0.0228302
avg_f1_score: 0.00256338

查看线上和线下树检索的配置参数，也是对齐的，树每层取400,最终取top200个进行评估。

因为在模型转换(model_converter_example.sh)的时候，遇见(error:name is not contained in mxnet data=fc_b_1 )错误，发现是fc_b_1和fc_b_2没有成功存到线下模型产出的dense文件中，当时做法是直接把graph_ulf.txt对应的几项bias删除掉，能够正常转换

后来回想，以为是此问题导致的线上和线下指标对不齐。

随后追查发现离线模型导出参数，fc_b_1和fc_b_2并没有成功保存到checkpoint中，然后对tdm_ub_att_ubuntu中train.py相关进行了修改，能够成功保存fc_b_1和fc_b_2。代码改动如下. tdm_layer_master.py FullyConnected3D中的代码进行了改动，改动如下：

self.bias = mx.sym.ones(shape=(1, self.output_dim)) * 0.1 ----> self.bias = mx.sym.var(name='fc_b_%s' % self.version, shape=(1, self.output_dim), init=mx.init.Constant(0.1))

重新训练，最终线下模型predict结果和修改前基本无差异，紧接着模型转换也不会出现变量找不到的错误，但是线上评估结果依旧很差。

另外根据代码可知，线下predict评测使用的是最后一次checkpoint数据，而线上使用的是worker 0导出的模型文件

目前发现线上预估打分最终top item的score都是1。

同一条测试样本，线下top item打分如下:

2949127_T:20463, 0.999499;50142, 0.999482;10959, 0.999434;83256, 0.999244;16435, 0.999200;30899, 0.999035;101882, 0.998858;59799, 0.998675;12190, 0.998552;40287, 0.998492;19108, 0.998490;78018, 0.998479;2 0540, 0.998329;95618, 0.998156;64900, 0.998137;37469, 0.998074;38073, 0.998044;42087, 0.997949;30902, 0.997931;54074, 0.997561;20843, 0.997425;59043, 0.997383;45780, 0.997378;23563, 0.997245;89596, 0.9971 24;20541, 0.997077;19778, 0.997012;20465, 0.996992;46399, 0.996713;27555, 0.996561;29483, 0.996416;32650, 0.996198;25304, 0.996149;56910, 0.996009;85521, 0.995945;12093, 0.995805;48153, 0.995743;54072, 0.995560;83236, 0.995437;76131, 0.995380;31814, 0.995288;27888, 0.995282;83262, 0.995143;20504, 0.994918;43837, 0.994917;61074, 0.994676;28350, 0.994396;39025, 0.994335;20473, 0.994318;41505, 0.993801;36632, 0.993600;96427, 0.993572;83223, 0.993534;15597, 0.993285;77203, 0.993166;61453, 0.992992;15603, 0.992714;33379, 0.992648;75105, 0.991894;17636, 0.991852;40495, 0.991694;17307, 0.991462;52800, 0.991134;52674, 0.990924;49948, 0.990409;76146, 0.990066;74603, 0.989750;13273, 0.989550;17716, 0.989269;25253, 0.988323;29477, 0.987356;16444, 0.985858;76168, 0.985582;42078, 0.985433;62847, 0.985287;23562, 0.9842 19;101883, 0.983762;15600, 0.983568;42089, 0.983296;38671, 0.982821;49946, 0.981332;30764, 0.981068;59793, 0.980869;46385, 0.980765;52840, 0.979820;103712, 0.979248;33037, 0.979136;27609, 0.978070;32640, 0.976367;113510, 0.972216;77519, 0.966582;62773, 0.965298;33772, 0.961509;84408, 0.958087;32643, 0.953578;26821, 0.925344;16467, 0.919517;122714, 0.917185;130521, 0.896764;27470, 0.682379;

线上top item打分如下：

I1125 16:10:15.730403 20904 tdm_example.cpp:165] Response: res_code: RC_SUCCESS
result_unit {
  id: 22551
  score: 1
}
result_unit {
  id: 52390
  score: 1
}
result_unit {
  id: 22248
  score: 1
}
result_unit {
  id: 96753
  score: 1
}
result_unit {
  id: 76728
  score: 1
}

...........

烦请帮忙定位下。

@lovickie @zhuhan1236 @songyue1104 @MaButing

MaButing commented 4 years ago

你是在哪个版本上跑的？具体是哪一个commit？

cuisonghui commented 4 years ago

你是在哪个版本上跑的？具体是哪一个commit？

master的04cc049版本

cuisonghui commented 4 years ago

一次线上打分，随机选了一段打分结果

I1126 10:42:44.168023 27738 blaze_model.cpp:311] result tensor score6.63089e-08 I1126 10:42:44.168028 27738 blaze_model.cpp:311] result tensor score1 I1126 10:42:44.168032 27738 blaze_model.cpp:311] result tensor score4.39856e-32 I1126 10:42:44.168038 27738 blaze_model.cpp:311] result tensor score1 I1126 10:42:44.168042 27738 blaze_model.cpp:311] result tensor score7.29002e-06 I1126 10:42:44.168058 27738 blaze_model.cpp:311] result tensor score0.999993 I1126 10:42:44.168064 27738 blaze_model.cpp:311] result tensor score3.96916e-18 I1126 10:42:44.168069 27738 blaze_model.cpp:311] result tensor score1 I1126 10:42:44.168073 27738 blaze_model.cpp:311] result tensor score1 I1126 10:42:44.168078 27738 blaze_model.cpp:311] result tensor score5.61036e-09 I1126 10:42:44.168083 27738 blaze_model.cpp:311] result tensor score3.29945e-07 I1126 10:42:44.168088 27738 blaze_model.cpp:311] result tensor score1 I1126 10:42:44.168092 27738 blaze_model.cpp:311] result tensor score0.000238458 I1126 10:42:44.168097 27738 blaze_model.cpp:311] result tensor score0.999762 I1126 10:42:44.168102 27738 blaze_model.cpp:311] result tensor score0.000439926 I1126 10:42:44.168107 27738 blaze_model.cpp:311] result tensor score0.99956 I1126 10:42:44.168112 27738 blaze_model.cpp:311] result tensor score0.436755 I1126 10:42:44.168117 27738 blaze_model.cpp:311] result tensor score0.563245 I1126 10:42:44.168121 27738 blaze_model.cpp:311] result tensor score0.00551774 I1126 10:42:44.168126 27738 blaze_model.cpp:311] result tensor score0.994482 I1126 10:42:44.168131 27738 blaze_model.cpp:311] result tensor score0.990996 I1126 10:42:44.168136 27738 blaze_model.cpp:311] result tensor score0.00900417 I1126 10:42:44.168140 27738 blaze_model.cpp:311] result tensor score1.69806e-35 I1126 10:42:44.168146 27738 blaze_model.cpp:311] result tensor score1 I1126 10:42:44.168151 27738 blaze_model.cpp:311] result tensor score9.43993e-25 I1126 10:42:44.168156 27738 blaze_model.cpp:311] result tensor score1 I1126 10:42:44.168161 27738 blaze_model.cpp:311] result tensor score3.46481e-18 I1126 10:42:44.168166 27738 blaze_model.cpp:311] result tensor score1 I1126 10:42:44.168170 27738 blaze_model.cpp:311] result tensor score1.57335e-17 I1126 10:42:44.168175 27738 blaze_model.cpp:311] result tensor score1 I1126 10:42:44.168180 27738 blaze_model.cpp:311] result tensor score1.46434e-18 I1126 10:42:44.168185 27738 blaze_model.cpp:311] result tensor score1 I1126 10:42:44.168190 27738 blaze_model.cpp:311] result tensor score1.11895e-11 I1126 10:42:44.168195 27738 blaze_model.cpp:311] result tensor score1 I1126 10:42:44.168200 27738 blaze_model.cpp:311] result tensor score5.23381e-13 I1126 10:42:44.168205 27738 blaze_model.cpp:311] result tensor score1 I1126 10:42:44.168208 27738 blaze_model.cpp:311] result tensor score1 I1126 10:42:44.168213 27738 blaze_model.cpp:311] result tensor score2.24174e-12 I1126 10:42:44.168218 27738 blaze_model.cpp:311] result tensor score4.20645e-08 I1126 10:42:44.168223 27738 blaze_model.cpp:311] result tensor score1 I1126 10:42:44.168228 27738 blaze_model.cpp:311] result tensor score1.9767e-19 I1126 10:42:44.168233 27738 blaze_model.cpp:311] result tensor score1 I1126 10:42:44.168238 27738 blaze_model.cpp:311] result tensor score4.37529e-05 I1126 10:42:44.168242 27738 blaze_model.cpp:311] result tensor score0.999956 I1126 10:42:44.168247 27738 blaze_model.cpp:311] result tensor score0.0252053 I1126 10:42:44.168252 27738 blaze_model.cpp:311] result tensor score0.974795 I1126 10:42:44.168257 27738 blaze_model.cpp:311] result tensor score0.0685788

发现除了最终top200的score全是1，其他的也有正常打分结果

cuisonghui commented 4 years ago

前一段时间模型离线predict发现两个问题，不知道是否和线上线下指标对不齐有关系:

问题1

就是之前在进行离线模型predict(分布式的,会启动ps)的过程中， 会出现数组越界的情况(c++中vector越界), 我也不知道我自己的问题还是普遍现象，不知道此问题和指标对不齐是否有关系，所以我就加上了两个边界判断，具体修改如下:

DMPREDICTOP::TDMExpandSample()函数: ...........

   std::vector<std::pair<int64_t, float> > unit_expand_ids_props;
  -  for (int i = 0; i < unit_expand_ids.size(); ++i) {
  +  for (int i = 0; i < unit_expand_ids.size() && i < unit_expand_props.size(); ++i) {
     unit_expand_ids_props.push_back(
         std::make_pair(unit_expand_ids.at(i), unit_expand_props.at(i)));
   }

.......

     std::vector<int64_t> final_top_ids;
  -    for (int i = 0; i < final_topk_; ++i) {
  +    for (int i = 0; i < final_topk_ && i < pred_ids.size(); ++i) {
       final_top_ids.push_back(pred_ids.at(i).first);
     }

....

问题2

tdm.json的"predict_io_pause_num": 10000 配置项必须要配置的小一些，要不然predict过程会卡住，怀疑是有死锁，改成100，就不会出现卡住的问题 predict_io_pause_num参数我追了一下代码,貌似是控制DoParse和DoPack的速度的，也就是如果DoParse产生了predict_io_pause_num个消息，DoPack还没来得及消费的话，DoParse会等待。

cuisonghui commented 4 years ago

另外在模型的train.py脚本中，会改一行代码(解决ps scheduler异常退出 #286)，不知道这个对模型最终的输出保存是否有影响。

....

if is_training:
-       xdl.execute(xdl.ps_synchronize_leave_op(np.array(xdl.get_task_index(), dtype=np.int32)))
+      xdl.execute(xdl.worker_report_finish_op(np.array(xdl.get_task_index(), dtype=np.int32)))
if xdl.get_task_index() == 0:
        print 'start put item_emb'

....

cuisonghui commented 4 years ago

两组实验

实验1，把score为1的item全部去掉，测试了几百个样本，召回率都为0.

实验2，把每层的检索数量从400调整到4000,召回率很可观(树的节点数180W，包括非叶子节点)，

I1126 16:24:24.639098  5673 tdm_evaluation.cpp:149] recall: 0.307692
I1126 16:25:13.204648  5673 tdm_evaluation.cpp:149] recall: 0.588235
I1126 16:25:43.108530  5673 tdm_evaluation.cpp:149] recall: 0.466667
I1126 16:26:13.128660  5673 tdm_evaluation.cpp:149] recall: 0.285714
I1126 16:26:43.059156  5673 tdm_evaluation.cpp:149] recall: 0.8
I1126 16:27:12.972267  5673 tdm_evaluation.cpp:149] recall: 0.16
I1126 16:27:42.979049  5673 tdm_evaluation.cpp:149] recall: 0.636364
I1126 16:28:12.959107  5673 tdm_evaluation.cpp:149] recall: 0.375
I1126 16:28:42.873072  5673 tdm_evaluation.cpp:149] recall: 0.5
I1126 16:29:12.715451  5673 tdm_evaluation.cpp:149] recall: 0.571429
I1126 16:29:42.633754  5673 tdm_evaluation.cpp:149] recall: 0.6
I1126 16:30:12.719106  5673 tdm_evaluation.cpp:149] recall: 0.375
I1126 16:30:42.685216  5673 tdm_evaluation.cpp:149] recall: 1
I1126 16:31:12.640736  5673 tdm_evaluation.cpp:149] recall: 0.714286
I1126 16:31:42.584805  5673 tdm_evaluation.cpp:149] recall: 0.625
I1126 16:32:12.495752  5673 tdm_evaluation.cpp:149] recall: 0.5
I1126 16:32:42.354529  5673 tdm_evaluation.cpp:149] recall: 0.333333
I1126 16:33:12.234979  5673 tdm_evaluation.cpp:149] recall: 0.535714
I1126 16:33:42.176301  5673 tdm_evaluation.cpp:149] recall: 0.428571
I1126 16:34:12.034087  5673 tdm_evaluation.cpp:149] recall: 0.3
I1126 16:34:41.860433  5673 tdm_evaluation.cpp:149] recall: 0.222222
I1126 16:35:11.825886  5673 tdm_evaluation.cpp:149] recall: 0.285714
I1126 16:35:41.581423  5673 tdm_evaluation.cpp:149] recall: 0.545455
I1126 16:36:11.443192  5673 tdm_evaluation.cpp:149] recall: 0.25
I1126 16:36:41.351567  5673 tdm_evaluation.cpp:149] recall: 0.7
I1126 16:37:11.204872  5673 tdm_evaluation.cpp:149] recall: 0.322581
I1126 16:37:41.035773  5673 tdm_evaluation.cpp:149] recall: 0.5
I1126 16:38:10.762917  5673 tdm_evaluation.cpp:149] recall: 0.173913
I1126 16:38:40.685608  5673 tdm_evaluation.cpp:149] recall: 0.269231
I1126 16:39:10.481540  5673 tdm_evaluation.cpp:149] recall: 0.352941
I1126 16:39:40.303911  5673 tdm_evaluation.cpp:149] recall: 0.5
I1126 16:40:10.253180  5673 tdm_evaluation.cpp:149] recall: 0.315789
I1126 16:40:39.975806  5673 tdm_evaluation.cpp:149] recall: 0.466667
I1126 16:41:09.805229  5673 tdm_evaluation.cpp:149] recall: 0.8
I1126 16:41:39.756392  5673 tdm_evaluation.cpp:149] recall: 0.2
I1126 16:42:09.657151  5673 tdm_evaluation.cpp:149] recall: 0.777778
I1126 16:42:39.501482  5673 tdm_evaluation.cpp:149] recall: 0.6
I1126 16:43:09.410542  5673 tdm_evaluation.cpp:149] recall: 0.6
I1126 16:43:39.299703  5673 tdm_evaluation.cpp:149] recall: 0.40625
I1126 16:44:09.143477  5673 tdm_evaluation.cpp:149] recall: 0.347826
I1126 16:44:39.188099  5673 tdm_evaluation.cpp:149] recall: 0.5
I1126 16:45:09.005282  5673 tdm_evaluation.cpp:149] recall: 0.428571
I1126 16:45:38.810240  5673 tdm_evaluation.cpp:149] recall: 0.6
I1126 16:46:08.935415  5673 tdm_evaluation.cpp:149] recall: 0.75
I1126 16:46:38.868216  5673 tdm_evaluation.cpp:149] recall: 0.6
I1126 16:47:08.953313  5673 tdm_evaluation.cpp:149] recall: 0.666667
I1126 16:47:38.860587  5673 tdm_evaluation.cpp:149] recall: 0.5
I1126 16:48:08.973392  5673 tdm_evaluation.cpp:149] recall: 0.0666667
I1126 16:48:38.796094  5673 tdm_evaluation.cpp:149] recall: 0.428571
I1126 16:49:08.765859  5673 tdm_evaluation.cpp:149] recall: 0.5
I1126 16:49:38.670840  5673 tdm_evaluation.cpp:149] recall: 0.818182
I1126 16:50:08.602705  5673 tdm_evaluation.cpp:149] recall: 0.857143
I1126 16:50:38.489814  5673 tdm_evaluation.cpp:149] recall: 0.285714

cuisonghui commented 4 years ago

之前由于cpu资源不足，训练很慢，所以把树采样配置从"tdmop_layer_counts": "0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,17,19,22,25,30,76,200",调整成了"tdmop_layer_counts": "0,1,2,3,4,5,6,7,8,9,1,1,1,1,1,1,1,1,2,2,3,7,2"

我这边恢复一下tdmop_layer_counts配置，先试着把xdl打分下拉一些，看看结果。

larsoncs commented 4 years ago

同问，线下，线上打分情况不一致，什么时候修复啊？

ustcdane commented 4 years ago

@lovickie @zhuhan1236 @songyue1104 @MaButing 麻烦看下，遇到了同样的问题，感谢~~~

MaButing commented 4 years ago

@ustcdane @larsoncs @cuisonghui 现在问题是这样，XDL升级1.2后模型结构会发生变化，导致blaze不兼容，模型转换工具无法使用#254。现在项目中ulf模型文件是根据XDL1.0产出的模型配置的，也和现在1.2不兼容，打分不对十有八九也是因为这个原因（没具体定位）。

目前我们针对tdm-att模型修复了#254，在分支“blaze-fix-v1.2”上，可以直接从XDL产出的graph.txt转出blaze的模型格式，不在依赖ulf。你们可以尝试一下这个分支，有问题的话请提出来。如果有人可以帮忙定位一下ulf的问题的话，就更好了^v^

cuisonghui commented 4 years ago

好的，谢谢，我先试试，看打分是否能对齐。

cuisonghui commented 4 years ago

[ERROR] [2019-12-03 10:56:10] [3783] [/home/work/data/code/x-deeplearning/blaze/blaze/api/cpp_api/predictor_manager_impl.cc:157] Create Model Predictor failed, eval_data/model/blaze_model/model.dat msg=[failed at workspace.h:90]. [/home/work/data/code/x-deeplearning/blaze/blaze/graph/workspace.h:90] input_name: %s data_type not defined/GetBatch:71

此问题已经通过tdm att train.py中添加


       prop = mx.symbol.SoftmaxOutput(data=dout, label=ph_label_click, grad_scale=1.0, use_ignore=True, normalization='valid')
        #model exprot start
       xdl.graph_tag().set_input(data_io) //添加此行代码导出graph input tag
        xdl.graph_tag().set_mx_output(prop)
        args.extend(prop.list_arguments())
        auxs.extend(prop.list_auxiliary_states())
        #model exprot start

        origin_loss = mx.sym.log(prop) * label
        ph_label_sum = mx.sym.reshape(ph_label_sum, shape=(bs, 1))
        origin_loss = mx.sym.broadcast_mul(origin_loss, ph_label_sum)
        loss = - mx.symbol.sum(origin_loss) / mx.sym.sum(ph_label_sum)
        return prop, loss

    re = dnn_model_define(emb, batch["indicators"][0], unit_id_expand_emb, batch["label"], data_io._batch_size, emb_dim, '20,20,10,10,2,2,2,1,1,1')

解决

cuisonghui commented 4 years ago

tdm_evaluation遇见如下问题, [INFO ] [2019-12-03 16:52:48] [13094] [/home/work/data/code/x-deeplearning/blaze/blaze/api/cpp_api/predictor.cc:133] New PredictorManager Could not create log file: File exists COULD NOT CREATE LOGFILE '20191203-165248.13094'! [INFO ] [2019-12-03 16:52:48] [13094] [/home/work/data/code/x-deeplearning/blaze/blaze/optimizer/optimizer.h:34] optimizer op num: 265 -> 256 terminate called after throwing an instance of 'blaze::Exception' what(): [failed at reshape_op.h:48]. not equal known_size=33120000 x->size()=847872

MaButing commented 4 years ago

what(): [failed at reshape_op.h:48]. not equal known_size=33120000 x->size()=847872

此报错是因为reshape前后tensor的size对不上，known_size=33120000是结果的size，而x->size()=847872是输入的size。可以用internal_shape接口检查一下各个tensor的shape是否符合预期

cuisonghui commented 4 years ago

what(): [failed at reshape_op.h:48]. not equal known_size=33120000 x->size()=847872

此报错是因为reshape前后tensor的size对不上，known_size=33120000是结果的size，而x->size()=847872是输入的size。可以用internal_shape接口检查一下各个tensor的shape是否符合预期

使用predictor_example.py运行了一下，使用internal_shape接口如下

name= reshape0_reshape (3L,)
name= reshape3_reshape (3L,)
name= reshape1_reshape (3L,)
name= reshape2_reshape (3L,)

使用internal_asnumpy接口如下

name= reshape0_reshape [-1 69 24]
name= reshape3_reshape [20000    -1    24]
name= reshape1_reshape [20000    69    24]  //这个正好等于33120000
name= reshape2_reshape [20000     1    24]

shape中的20000代表什么，是一个batch样本数量吗，我训练的时候train batch size设置的确实是20000

847872 = 512 69 24 33120000 = 20000 69 24

而我的程序正好挂在tree level 9,512个节点预测的时候(I1203 17:51:22.666743 13564 tree_searcher.cpp:115] level 9 need calc, 512 to 400)，那20000这个数字确实有些问题了，不应该有"死值"在

发现graph.txt文件中,mxnet op中各种shape和训练时的batch size相关，这个是不是有问题?

另外tdm train.py脚本中的 unit_id_expand_emb = xdl.embedding(emb_name, batch["unit_id_expand"], xdl.Normal(stddev=0.001), emb_dim, 50000, emb_combiner, vtype="hash", feature_add_probability=feature_add_probability) 中的50000代表什么含义，貌似没有实际作用，我之前训练把50000改成了800000不知道是不是这里导致的问题。 xdl.embedding函数中50000对应的参数解释如下: feature_dim: sparse input dimension, for pre-allocate memory 我理解的含义就是树的节点总数

而我最终树的节点总数是170W左右。

MaButing commented 4 years ago

在线inference的时候batchsize的确是不确定的，所以在feed输入的tensor之前，应该要用reshape_input()接口指定input tensor的shape，让blaze得知batchsize的大小从而分配相应内存。你上面那个internal_shape的结果看起来像是没有经过reshape的，不知道你有没有调。

发现graph.txt文件中,mxnet op中各种shape和训练时的batch size相关，这个是不是有问题?

这个好像确实不应该，还请排查一下离线训练的部分（我只了解blaze的部分）。

cuisonghui commented 4 years ago

在线inference的时候batchsize的确是不确定的，所以在feed输入的tensor之前，应该要用reshape_input()接口指定input tensor的shape，让blaze得知batchsize的大小从而分配相应内存。你上面那个internal_shape的结果看起来像是没有经过reshape的，不知道你有没有调。

发现graph.txt文件中,mxnet op中各种shape和训练时的batch size相关，这个是不是有问题?

这个好像确实不应该，还请排查一下离线训练的部分（我只了解blaze的部分）。

你上面那个internal_shape的结果看起来像是没有经过reshape的，不知道你有没有调。

有没有调，指的是什么，上面的结果我没有手动改过graph.txt结构。上面的结果使用下面的代码跑的,predictor_example.py文件

    pm = PredictorManager()
    pm.load_sparse_model_weight(qed_file)
    pm.load_model(model_file, optimization_pass)
    predictor = pm.create_predictor(device_type, device_id)
    names = predictor.list_internal_names()
    for name in names:
      #data = predictor.internal_asnumpy(name)
      data = predictor.internal_shape(name)
      print 'name=', name, data

cuisonghui commented 4 years ago

在线inference的时候batchsize的确是不确定的，所以在feed输入的tensor之前，应该要用reshape_input()接口指定input tensor的shape，让blaze得知batchsize的大小从而分配相应内存。你上面那个internal_shape的结果看起来像是没有经过reshape的，不知道你有没有调。

发现graph.txt文件中,mxnet op中各种shape和训练时的batch size相关，这个是不是有问题?

这个好像确实不应该，还请排查一下离线训练的部分（我只了解blaze的部分）。

class ReshapeOp : public Operator<Context> {
 public:
  USE_OPERATOR_FUNCTIONS(Context);

  ReshapeOp(const OperatorDef& def, Workspace* workspace) :
      Operator<Context>(def, workspace) { }

  bool RunOnDevice() override {
    Blob* x = this->Input(0);
    Blob* shape_blob = this->Input(1);
    Blob* y = this->Output(0);
.....

查看ReshapeOp来看， this->Input(1)就是从blaze模型结构中读取出来的tensor，而 this->Input(0)就是feed进来的tensor(shape(512,69,24))，问题现在就是this->Input(1)的shape是(20000,69,24), 导致 what(): [failed at reshape_op.h:48]. not equal known_size=33120000 x->size()=847872 这个报错。

cuisonghui commented 4 years ago

以上的问题先手工写死网络shape给屏蔽掉了，又报了下面的错误 what(): [failed at matmul_op.h:135]a_k == b_k. a_k=72 b_k=36

线下 net_dot = mx.symbol.batch_dot(lhs=bottom_data, rhs=self.weight)

转换成blaze结构 op { type: "MatMul" name: "batch_dot0" input: "concat1" input: "broadcast_to1" output: "batch_dot0" arg { name: "transB" i: 1 } }

多了个转置项，正好把正确的矩阵转换错误。为什么会有转置项

两个原始矩阵shape (512, 69, 72) (512, 72, 36) 其中(512, 72, 36)--->被转置成了(512,36,72)

update:

void MXNetImporter::ProcessBatchdotOp(const JSONNode& node) {
  auto op = AddOperatorDef(node, "MatMul");
  auto transpose_b = GetStrAttr(node, "transpose_b", "True");//这里貌似有问题，改成False，报错规避。
  auto arg = op->add_arg();
  arg->set_name("transB");
  arg->set_i(strcmp(transpose_b, "True") == 0 ? 1 : 0);
}

cuisonghui commented 4 years ago

目前还存在的比较疑惑的问题是 BroadcastToOp不能根据feed进来的数据的batchsize动态调整shape, BroadcastToOp读取的是模型文件中的shape配置，这个配置是个死值。BroadcastTo读取下一个op输入的batchsize也不现实： op { type: "BroadcastTo" name: "broadcast_to1" input: "fc_w_1" output: "broadcast_to1" arg { name: "shape" ints: 512 ints: 0 ints: 0 } } 因为这个op的输出是下面算子的输入

op { type: "MatMul" name: "batch_dot0" input: "concat1" #shape(batchsize,,) input: "broadcast_to1" #shape(死值,,) output: "batch_dot0" arg { name: "transB" i: 0 } }

但是我把BroadcastTo 配置中的op配置项shape变成(0,0,0), MatMulOP竟然也没有报错。相当于使用了shape {512, 69, 72} shape {1, 72, 36} 做了矩阵乘法。。。

update:

嗯，后续看了下MatMulOp源码，是支持自动Broacast的。

cuisonghui commented 4 years ago

上面遇到的问题，以及暂时的解决办法如下，可以提供参考，只能暂时跑通，打分不能对齐

目前手工的对xdl模型导出的graph.txt文件进行了修改，主要对xdl导出的mxnet网络结构reshapeop/broadcastop的shape属性进行修改，修改后的blaze模型是可以run起来的，但是由于broadcast op操作无法获取下一个MatMul op中feed的数据的batchsize的大小，导致broadcast op的shape只能全部改成(0,0,0)，这个问题上楼已经说明。虽然可以run起来，召回率看起来也还行，但是打分无法对齐线下。

对graph.txt修改前和修改后，导出的blaze_model差异对比如下：

  op {
    type: "ConstantFill"
    name: "reshape3_reshape"
    output: "reshape3_reshape"
    arg {
      name: "dtype"
      i: 2
    }
    arg {
      name: "shape"
      ints: 3
    }
    arg {
      name: "value"
-    ints: 20000
-    ints: -1
+   ints: -1
+   ints: 0
      ints: 24
    }
    device_option {
      device_type: 0
    }
  }

op {
    type: "ConstantFill"
    name: "reshape1_reshape"
    output: "reshape1_reshape"
    arg {
      name: "dtype"
      i: 2
    }
    arg {
      name: "shape"
      ints: 3
    }
    arg {
      name: "value"
-     ints: 20000
+    ints: -1
      ints: 69
      ints: 24
    }
    device_option {
      device_type: 0
    }
  }

  op {
    type: "ConstantFill"
    name: "reshape2_reshape"
    output: "reshape2_reshape"
    arg {
      name: "dtype"
      i: 2
    }
    arg {
      name: "shape"
      ints: 3
    }
    arg {
      name: "value"
-     ints: 20000
+    ints: -1
      ints: 1
      ints: 24
    }
    device_option {
      device_type: 0
    }
  }

op {
    type: "BroadcastTo"
    name: "broadcast_to1"
    input: "fc_w_1"
    output: "broadcast_to1"
    arg {
      name: "shape"
-    ints: 20000
+   ints: 0
      ints: 0
      ints: 0
    }
  }

  op {
    type: "BroadcastTo"
    name: "broadcast_to2"
    input: "fc_w_2"
    output: "broadcast_to2"
    arg {
      name: "shape"
-    ints: 20000
+    ints: 0
      ints: 0
      ints: 0
    }
  }

下面两个算子是通过修改代码产生的，为了解决被转置问题，代码如下:

void MXNetImporter::ProcessBatchdotOp(const JSONNode& node) {
  auto op = AddOperatorDef(node, "MatMul");
- auto transpose_b = GetStrAttr(node, "transpose_b", "True");
+  auto transpose_b = GetStrAttr(node, "transpose_b", "False");
  auto arg = op->add_arg();
  arg->set_name("transB");
  arg->set_i(strcmp(transpose_b, "True") == 0 ? 1 : 0);
}

op {
  type: "MatMul"
  name: "batch_dot0"
  input: "concat1"
  input: "broadcast_to1"
  output: "batch_dot0"
  arg {
    name: "transB"
-  i: 1
+  i: 0
  }
}

op {
  type: "MatMul"
  name: "batch_dot1"
  input: "elementwisesum0"
  input: "broadcast_to2"
  output: "batch_dot1"
  arg {
    name: "transB"
-  i: 1
+   i: 0
  }
}

cuisonghui commented 4 years ago

目前使用master版本blaze，和官网镜像默认xdl对齐, #313 ,目前镜像版本xdl，代码中没有对应版本。 xdl1.2和xdl1.0版本还有待验证

cuisonghui commented 4 years ago

目前验证xdl1.0版本代码能对齐(打分差异0.001左右，item排序一致)。

最新代码，也就是xdl1.2，打分无法对齐(还需要再验证，验证时间比较仓促)。但是使用graph.txt和graph_ulf.txt对转换为blaze，两者的分数是一致的，虽然对不其xdl模型打分。

目前两个问题需要解决: 1 使用graph.txt转换有问题，需要手动改动graph.txt文件 2 xdl1.2版本xdl和blaze打分无法对齐。