IvanaXu / iDeepRec

DeepRec For Me https://github.com/alibaba/DeepRec
https://deeprec.readthedocs.io/zh/latest/index.html
Apache License 2.0
3 stars 1 forks source link

🔗 https://github.com/IvanaXu/iDeepRec/tree/main/pro/DeepRec/tianchi/DLRM#stand-alone-training #62

Open IvanaXu opened 1 year ago

IvanaXu commented 1 year ago
docker pull alideeprec/deeprec-release-modelzoo:latest
docker run -it alideeprec/deeprec-release-modelzoo:latest /bin/bash
cd /root/modelzoo/dlrm

python train.py

# Memory acceleration with jemalloc.
# The required ENV `MALLOC_CONF` is already set in the code.
LD_PRELOAD=../libjemalloc.so.2.5.1 python train.py
IvanaXu commented 1 year ago
INFO:tensorflow:global_step/sec: 142.617
INFO:tensorflow:loss = 0.5084822, steps = 15501 (0.701 sec)
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:global_step/sec: 137.715
INFO:tensorflow:loss = 0.42516267, steps = 15601 (0.726 sec)
INFO:tensorflow:Saving checkpoints for 15625 into ./result/model_DLRM_1667521630/model.ckpt.
Training completed.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:run with loading checkpoint
INFO:tensorflow:Restoring parameters from ./result/model_DLRM_1667521630/model.ckpt-15625
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Evaluation complate:[1000/3907]
Evaluation complate:[2000/3907]
Evaluation complate:[3000/3907]
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
Evaluation complate:[3907/3907]
ACC = 0.7830795049667358
AUC = 0.7807814478874207
IvanaXu commented 1 year ago
INFO:tensorflow:global_step/sec: 148.565
INFO:tensorflow:loss = 0.4433312, steps = 15501 (0.673 sec)
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:global_step/sec: 144.373
INFO:tensorflow:loss = 0.41600972, steps = 15601 (0.693 sec)
INFO:tensorflow:Saving checkpoints for 15625 into ./result/model_DLRM_1667521882/model.ckpt.
Training completed.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:run with loading checkpoint
INFO:tensorflow:Restoring parameters from ./result/model_DLRM_1667521882/model.ckpt-15625
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Evaluation complate:[1000/3907]
Evaluation complate:[2000/3907]
Evaluation complate:[3000/3907]
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
Evaluation complate:[3907/3907]
ACC = 0.7839229702949524
AUC = 0.7821462154388428
IvanaXu commented 1 year ago
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ./result/model_DIEN_1667522105/model.ckpt.
INFO:tensorflow:Create incremental timer, incremental_save:False, incremental_save_secs:None
INFO:tensorflow:loss = 1.1077905, steps = 1
INFO:tensorflow:global_step/sec: 1.02738
INFO:tensorflow:loss = 0.95943296, steps = 101 (97.337 sec)
IvanaXu commented 1 year ago
INFO:tensorflow:loss = 0.5911528, steps = 1801 (10.451 sec)
INFO:tensorflow:global_step/sec: 9.69216
INFO:tensorflow:loss = 0.5946928, steps = 1901 (10.318 sec)
INFO:tensorflow:global_step/sec: 9.74381
INFO:tensorflow:loss = 0.55903524, steps = 2001 (10.263 sec)
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:global_step/sec: 9.72112
INFO:tensorflow:loss = 0.58782303, steps = 2101 (10.287 sec)
INFO:tensorflow:Saving checkpoints for 2122 into ./result/model_DIN_1667522357/model.ckpt.
Training completed.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:run with loading checkpoint
INFO:tensorflow:Restoring parameters from ./result/model_DIN_1667522357/model.ckpt-2122
2022-11-04 00:43:12.791780: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_else_const head/gradients/head/loss/xentropy/Select_grad/zeros_like
2022-11-04 00:43:12.791838: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/head/loss/xentropy/Select_grad/Select_1
2022-11-04 00:43:12.791854: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1
2022-11-04 00:43:12.791910: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_else_const head/gradients/attention_layer/Select_grad/zeros_like
2022-11-04 00:43:12.791930: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/attention_layer/Select_grad/Select_1
2022-11-04 00:43:12.791956: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/attention_layer/Select_grad/tuple/control_dependency_1
2022-11-04 00:43:12.792539: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_then_const head/gradients/input_layer/input_layer/UID_embedding/UID_embedding_weights_grad/zeros_like
2022-11-04 00:43:12.792583: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/input_layer/UID_embedding/UID_embedding_weights_grad/Select
2022-11-04 00:43:12.792602: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/input_layer/UID_embedding/UID_embedding_weights_grad/tuple/control_dependency
2022-11-04 00:43:12.793145: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/input_layer/UID_embedding/UID_embedding_weights][new_name:fused_op_1_select_then_scalar]
2022-11-04 00:43:12.793766: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar] match op[head/loss/xentropy/Select][new_name:fused_op_1_select_else_scalar]
2022-11-04 00:43:12.794364: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_grad/Select][new_name:fused_op_1_select_else_scalar_in_grad]
2022-11-04 00:43:12.794402: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_1_grad/Select][new_name:fused_op_2_select_else_scalar_in_grad]
2022-11-04 00:43:12.794434: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/attention_layer/Select_grad/Select][new_name:fused_op_3_select_else_scalar_in_grad]
2022-11-04 00:43:12.795014: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_1_grad/Select_1][new_name:fused_op_1_select_then_scalar_in_grad]
2022-11-04 00:43:12.795058: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/input_layer/UID_embedding/UID_embedding_weights_grad/Select_1][new_name:fused_op_2_select_then_scalar_in_grad]
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Evaluation complate:[100/237]
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
Evaluation complate:[200/237]
Evaluation complate:[237/237]
ACC = 0.6883002519607544
AUC = 0.7631368041038513
IvanaXu commented 1 year ago
INFO:tensorflow:global_step/sec: 74.8738
INFO:tensorflow:loss = 0.16757868, steps = 15501 (1.336 sec)
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:global_step/sec: 71.7669
INFO:tensorflow:loss = 0.1547018, steps = 15601 (1.393 sec)
INFO:tensorflow:Saving checkpoints for 15625 into ./result/model_DeepFM_1667522871/model.ckpt.
Training completed.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:run with loading checkpoint
INFO:tensorflow:Restoring parameters from ./result/model_DeepFM_1667522871/model.ckpt-15625
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Evaluation complate:[1000/3907]
Evaluation complate:[2000/3907]
Evaluation complate:[3000/3907]
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
Evaluation complate:[3907/3907]
ACC = 0.7833489775657654
AUC = 0.777617335319519
IvanaXu commented 1 year ago
INFO:tensorflow:loss = 0.08075554, steps = 97501 (1.134 sec)
INFO:tensorflow:global_step/sec: 89.501
INFO:tensorflow:loss = 0.11768236, steps = 97601 (1.117 sec)
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Saving checkpoints for 97657 into ./result/model_MMOE_1667523186/model.ckpt.
Training completed.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:run with loading checkpoint
INFO:tensorflow:Restoring parameters from ./result/model_MMOE_1667523186/model.ckpt-97657
2022-11-04 01:11:50.191893: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_else_const head/gradients/head/loss/xentropy/Select_grad/zeros_like
2022-11-04 01:11:50.192006: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/head/loss/xentropy/Select_grad/Select_1
2022-11-04 01:11:50.192036: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1
2022-11-04 01:11:50.195364: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar] match op[head/loss/xentropy/Select][new_name:fused_op_1_select_else_scalar]
2022-11-04 01:11:50.196469: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_grad/Select][new_name:fused_op_1_select_else_scalar_in_grad]
2022-11-04 01:11:50.196501: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_1_grad/Select][new_name:fused_op_2_select_else_scalar_in_grad]
2022-11-04 01:11:50.197561: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_1_grad/Select_1][new_name:fused_op_1_select_then_scalar_in_grad]
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
Evaluation complete:[20/20]
ACC = 0.9731500148773193
AUC = 0.7530704736709595
IvanaXu commented 1 year ago
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ./result/model_WIDE_AND_DEEP_1667524447/model.ckpt.
INFO:tensorflow:Create incremental timer, incremental_save:False, incremental_save_secs:None
INFO:tensorflow:loss = 0.71554154, steps = 1
2022-11-04 01:14:27.671448: I tensorflow/core/common_runtime/tensorpool_allocator.cc:146] TensorPoolAllocator enabled
INFO:tensorflow:global_step/sec: 5.86205
INFO:tensorflow:loss = 0.52297497, steps = 101 (17.060 sec)
INFO:tensorflow:global_step/sec: 5.81553
INFO:tensorflow:loss = 0.47905684, steps = 201 (17.195 sec)
INFO:tensorflow:global_step/sec: 5.83783
INFO:tensorflow:loss = 0.5278871, steps = 301 (17.130 sec)
INFO:tensorflow:global_step/sec: 5.99602