leezu commented 4 years ago

Remove params and prefix arguments for MXNet 2 and update parameter sharing implementation
Remove Block.name_scope() for MXNet 2
Remove self.params.get() and self.params.get_constant()

CI will pass once https://github.com/apache/incubator-mxnet/pull/18619 is merged

Please review the API changes. Scripts are not updated by this PR, but at least some will be updated and included here following verification of fine-tuning performance and NMT training. Note that it's not required to re-generate the parameter files.

Thanks to @acphile for his hard work on the Gluon API refactor on the MXNet side (https://github.com/apache/incubator-mxnet/commit/cb54a4a99463b23b8abaa2629661954c4ba3c60b)

sxjscience commented 4 years ago

@ZheyuYe @hymzoque Would you also help review?

leezu commented 4 years ago

% export VERSION=2.0
export MODEL_NAME=google_albert_base_v2
python -m pdb -c continue run_squad.py \
    --model_name ${MODEL_NAME} \
    --data_dir squad \
    --output_dir fintune_${MODEL_NAME}_squad_${VERSION} \
    --version ${VERSION} \
    --do_eval \
    --do_train \
    --batch_size 4 \
    --num_accumulated 3 \
    --gpus 0,1,2,3 \
    --epochs 3 \
    --lr 2e-5 \
    --warmup_ratio 0.1 \
    --wd=0.01 \
    --max_seq_length 512 \
    --max_grad_norm 0.1 \
    --overwrite_cache
All Logs will be saved to fintune_google_albert_base_v2_squad_2.0/finetune_squad2.0.log
[00:17:07] ../src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[00:17:09] ../src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[00:17:12] ../src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[00:17:14] ../src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
2020-07-11 00:17:19,894 - root - INFO - Loading Backbone Model from /home/ubuntu/.mxnet/models/nlp/google_albert_base_v2/model-125be477.params, with total/fixd parameters=11092992/0
/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py:568: UserWarning: Parameter 'weight' is already initialized, ignoring. Set force_reinit=True to re-initialize.
  v.initialize(None, ctx, init, force_reinit=force_reinit)
/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py:568: UserWarning: Parameter 'bias' is already initialized, ignoring. Set force_reinit=True to re-initialize.
  v.initialize(None, ctx, init, force_reinit=force_reinit)
/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py:568: UserWarning: Parameter 'gamma' is already initialized, ignoring. Set force_reinit=True to re-initialize.
  v.initialize(None, ctx, init, force_reinit=force_reinit)
/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py:568: UserWarning: Parameter 'beta' is already initialized, ignoring. Set force_reinit=True to re-initialize.
  v.initialize(None, ctx, init, force_reinit=force_reinit)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 442/442 [00:00<00:00, 1235.96it/s]
2020-07-11 00:17:20,959 - root - INFO - Load data from squad, Version=2.0
2020-07-11 00:17:20,959 - root - INFO - Tokenize Training Data:
2020-07-11 00:17:43,094 - root - INFO - Done! Time spent:22.14 seconds
2020-07-11 00:17:55,256 - root - INFO - Processing the Training data:
2020-07-11 00:18:02,455 - root - INFO - Done! #Unreliable Span=18 / #Mismatched Answer=30 / #Total=130319
2020-07-11 00:18:02,518 - root - INFO - Before Chunking, #Train/Is Impossible = 130319/43498
2020-07-11 00:18:02,519 - root - INFO - After Chunking, #Train Sample/Is Impossible = 130614/43737
2020-07-11 00:18:02,519 - root - INFO - Using gradient accumulation. Effective global batch size = 48
2020-07-11 00:18:02,570 - root - INFO - #Total Training Steps=8164, Warmup=816, Save Interval=2721
[00:18:04] ../src/kvstore/././comm.h:757: only 0 out of 12 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[00:18:04] ../src/kvstore/././comm.h:766: ....
[00:18:04] ../src/kvstore/././comm.h:766: ....
[00:18:04] ../src/kvstore/././comm.h:766: ....
[00:18:04] ../src/kvstore/././comm.h:766: ....
2020-07-11 00:19:27,266 - root - INFO - Epoch: 1, Batch: 300/8164, Loss span/answer/total=3.5320/0.3241/3.8561, LR=0.00000245, grad_norm=45.7169. Time cost=84.68, Throughput=56.68 samples/s ETA=1.90h
2020-07-11 00:20:55,263 - root - INFO - Epoch: 1, Batch: 600/8164, Loss span/answer/total=1.5043/0.3069/1.8112, LR=0.00000490, grad_norm=28.4342. Time cost=88.00, Throughput=54.55 samples/s ETA=1.91h
2020-07-11 00:22:27,184 - root - INFO - Epoch: 1, Batch: 900/8164, Loss span/answer/total=1.1916/0.2865/1.4781, LR=0.00000735, grad_norm=25.8383. Time cost=91.92, Throughput=52.22 samples/s ETA=1.93h
2020-07-11 00:23:56,512 - root - INFO - Epoch: 1, Batch: 1200/8164, Loss span/answer/total=1.0523/0.2417/1.2940, LR=0.00000980, grad_norm=33.3647. Time cost=89.33, Throughput=53.73 samples/s ETA=1.91h
2020-07-11 00:25:25,283 - root - INFO - Epoch: 1, Batch: 1500/8164, Loss span/answer/total=0.9825/0.2295/1.2121, LR=0.00001225, grad_norm=34.0635. Time cost=88.77, Throughput=54.07 samples/s ETA=1.88h
2020-07-11 00:26:55,038 - root - INFO - Epoch: 1, Batch: 1800/8164, Loss span/answer/total=0.9403/0.2245/1.1648, LR=0.00001471, grad_norm=31.0572. Time cost=89.75, Throughput=53.48 samples/s ETA=1.86h
2020-07-11 00:28:24,433 - root - INFO - Epoch: 1, Batch: 2100/8164, Loss span/answer/total=0.8861/0.2158/1.1019, LR=0.00001716, grad_norm=22.4757. Time cost=89.39, Throughput=53.69 samples/s ETA=1.84h
2020-07-11 00:29:53,440 - root - INFO - Epoch: 1, Batch: 2400/8164, Loss span/answer/total=0.8841/0.2107/1.0948, LR=0.00001961, grad_norm=24.0204. Time cost=89.01, Throughput=53.93 samples/s ETA=1.82h
2020-07-11 00:31:22,515 - root - INFO - Epoch: 1, Batch: 2700/8164, Loss span/answer/total=0.8298/0.2144/1.0442, LR=0.00001977, grad_norm=30.5798. Time cost=89.07, Throughput=53.89 samples/s ETA=1.79h
2020-07-11 00:32:50,504 - root - INFO - Epoch: 1, Batch: 3000/8164, Loss span/answer/total=0.8359/0.1936/1.0295, LR=0.00001950, grad_norm=22.0126. Time cost=87.99, Throughput=54.55 samples/s ETA=1.77h
2020-07-11 00:34:18,924 - root - INFO - Epoch: 1, Batch: 3300/8164, Loss span/answer/total=0.7798/0.1823/0.9621, LR=0.00001923, grad_norm=23.6958. Time cost=88.42, Throughput=54.29 samples/s ETA=1.74h
2020-07-11 00:35:47,793 - root - INFO - Epoch: 1, Batch: 3600/8164, Loss span/answer/total=0.7839/0.1830/0.9670, LR=0.00001895, grad_norm=16.4897. Time cost=88.87, Throughput=54.01 samples/s ETA=1.72h
2020-07-11 00:37:18,464 - root - INFO - Epoch: 1, Batch: 3900/8164, Loss span/answer/total=0.7729/0.1773/0.9502, LR=0.00001868, grad_norm=21.5925. Time cost=90.67, Throughput=52.94 samples/s ETA=1.70h
2020-07-11 00:38:47,375 - root - INFO - Epoch: 1, Batch: 4200/8164, Loss span/answer/total=0.7711/0.1762/0.9473, LR=0.00001841, grad_norm=20.0461. Time cost=88.91, Throughput=53.99 samples/s ETA=1.67h
2020-07-11 00:40:14,275 - root - INFO - Epoch: 1, Batch: 4500/8164, Loss span/answer/total=0.7453/0.1691/0.9144, LR=0.00001814, grad_norm=30.3925. Time cost=86.90, Throughput=55.24 samples/s ETA=1.64h
2020-07-11 00:41:35,982 - root - INFO - Epoch: 1, Batch: 4800/8164, Loss span/answer/total=0.7216/0.1741/0.8957, LR=0.00001787, grad_norm=18.9427. Time cost=81.71, Throughput=58.75 samples/s ETA=1.61h
2020-07-11 00:42:55,136 - root - INFO - Epoch: 1, Batch: 5100/8164, Loss span/answer/total=0.7020/0.1604/0.8624, LR=0.00001759, grad_norm=23.6912. Time cost=79.15, Throughput=60.64 samples/s ETA=1.58h
2020-07-11 00:44:15,843 - root - INFO - Epoch: 1, Batch: 5400/8164, Loss span/answer/total=0.7096/0.1699/0.8795, LR=0.00001732, grad_norm=21.2201. Time cost=80.71, Throughput=59.47 samples/s ETA=1.55h
2020-07-11 00:45:40,243 - root - INFO - Epoch: 1, Batch: 5700/8164, Loss span/answer/total=0.6996/0.1570/0.8566, LR=0.00001705, grad_norm=26.3110. Time cost=84.40, Throughput=56.87 samples/s ETA=1.52h
2020-07-11 00:47:10,151 - root - INFO - Epoch: 1, Batch: 6000/8164, Loss span/answer/total=0.7188/0.1689/0.8877, LR=0.00001678, grad_norm=22.5355. Time cost=89.91, Throughput=53.39 samples/s ETA=1.50h
2020-07-11 00:48:41,080 - root - INFO - Epoch: 1, Batch: 6300/8164, Loss span/answer/total=0.6882/0.1557/0.8439, LR=0.00001651, grad_norm=29.5829. Time cost=90.93, Throughput=52.79 samples/s ETA=1.47h
2020-07-11 00:50:09,563 - root - INFO - Epoch: 1, Batch: 6600/8164, Loss span/answer/total=0.6520/0.1574/0.8094, LR=0.00001623, grad_norm=23.4113. Time cost=88.48, Throughput=54.25 samples/s ETA=1.45h
2020-07-11 00:51:37,894 - root - INFO - Epoch: 1, Batch: 6900/8164, Loss span/answer/total=0.6585/0.1519/0.8104, LR=0.00001596, grad_norm=26.2093. Time cost=88.33, Throughput=54.34 samples/s ETA=1.43h
2020-07-11 00:53:08,867 - root - INFO - Epoch: 1, Batch: 7200/8164, Loss span/answer/total=0.6442/0.1497/0.7940, LR=0.00001569, grad_norm=20.8646. Time cost=90.97, Throughput=52.76 samples/s ETA=1.41h
2020-07-11 00:54:36,239 - root - INFO - Epoch: 1, Batch: 7500/8164, Loss span/answer/total=0.6603/0.1520/0.8124, LR=0.00001542, grad_norm=19.0104. Time cost=87.37, Throughput=54.94 samples/s ETA=1.38h
2020-07-11 00:56:03,071 - root - INFO - Epoch: 1, Batch: 7800/8164, Loss span/answer/total=0.6424/0.1477/0.7901, LR=0.00001514, grad_norm=17.0360. Time cost=86.83, Throughput=55.28 samples/s ETA=1.36h
2020-07-11 00:57:33,741 - root - INFO - Epoch: 1, Batch: 8100/8164, Loss span/answer/total=0.6654/0.1567/0.8222, LR=0.00001487, grad_norm=23.4842. Time cost=90.67, Throughput=52.94 samples/s ETA=1.33h
2020-07-11 00:57:52,195 - root - INFO - Params saved in: fintune_google_albert_base_v2_squad_2.0/google_albert_base_v2_squad2.0_2721.params
2020-07-11 00:57:52,471 - root - INFO - Epoch: 1, #Samples: 130614, Throughput=54.65 samples/s
2020-07-11 00:59:02,331 - root - INFO - Epoch: 2, Batch: 234/8164, Loss span/answer/total=0.5789/0.1344/0.7133, LR=0.00001460, grad_norm=16.2924. Time cost=69.86, Throughput=68.11 samples/s ETA=1.31h
2020-07-11 01:00:32,047 - root - INFO - Epoch: 2, Batch: 534/8164, Loss span/answer/total=0.5617/0.1154/0.6771, LR=0.00001433, grad_norm=20.1867. Time cost=89.72, Throughput=53.50 samples/s ETA=1.29h
2020-07-11 01:02:02,831 - root - INFO - Epoch: 2, Batch: 834/8164, Loss span/answer/total=0.5653/0.1218/0.6871, LR=0.00001406, grad_norm=22.1036. Time cost=90.78, Throughput=52.87 samples/s ETA=1.26h
2020-07-11 01:03:30,693 - root - INFO - Epoch: 2, Batch: 1134/8164, Loss span/answer/total=0.5546/0.1236/0.6782, LR=0.00001378, grad_norm=30.1931. Time cost=87.86, Throughput=54.63 samples/s ETA=1.24h
2020-07-11 01:05:00,728 - root - INFO - Epoch: 2, Batch: 1434/8164, Loss span/answer/total=0.5737/0.1272/0.7009, LR=0.00001351, grad_norm=22.7749. Time cost=90.03, Throughput=53.31 samples/s ETA=1.21h
2020-07-11 01:06:30,931 - root - INFO - Epoch: 2, Batch: 1734/8164, Loss span/answer/total=0.5298/0.1088/0.6386, LR=0.00001324, grad_norm=24.9865. Time cost=90.20, Throughput=53.21 samples/s ETA=1.19h
2020-07-11 01:07:59,987 - root - INFO - Epoch: 2, Batch: 2034/8164, Loss span/answer/total=0.5410/0.1196/0.6605, LR=0.00001297, grad_norm=15.9100. Time cost=89.06, Throughput=53.90 samples/s ETA=1.17h
2020-07-11 01:09:27,384 - root - INFO - Epoch: 2, Batch: 2334/8164, Loss span/answer/total=0.5435/0.1265/0.6700, LR=0.00001269, grad_norm=27.5773. Time cost=87.40, Throughput=54.92 samples/s ETA=1.14h
2020-07-11 01:10:56,891 - root - INFO - Epoch: 2, Batch: 2634/8164, Loss span/answer/total=0.5214/0.1204/0.6418, LR=0.00001242, grad_norm=20.9572. Time cost=89.51, Throughput=53.63 samples/s ETA=1.12h
2020-07-11 01:12:26,424 - root - INFO - Epoch: 2, Batch: 2934/8164, Loss span/answer/total=0.5504/0.1133/0.6637, LR=0.00001215, grad_norm=20.2055. Time cost=89.53, Throughput=53.61 samples/s ETA=1.09h
2020-07-11 01:13:55,887 - root - INFO - Epoch: 2, Batch: 3234/8164, Loss span/answer/total=0.5361/0.1150/0.6511, LR=0.00001188, grad_norm=17.7540. Time cost=89.46, Throughput=53.65 samples/s ETA=1.07h
2020-07-11 01:15:25,570 - root - INFO - Epoch: 2, Batch: 3534/8164, Loss span/answer/total=0.5298/0.1160/0.6458, LR=0.00001161, grad_norm=25.7555. Time cost=89.68, Throughput=53.52 samples/s ETA=1.05h
2020-07-11 01:16:55,303 - root - INFO - Epoch: 2, Batch: 3834/8164, Loss span/answer/total=0.5634/0.1215/0.6848, LR=0.00001133, grad_norm=16.5189. Time cost=89.73, Throughput=53.49 samples/s ETA=1.02h
2020-07-11 01:18:24,537 - root - INFO - Epoch: 2, Batch: 4134/8164, Loss span/answer/total=0.5251/0.1104/0.6355, LR=0.00001106, grad_norm=33.4706. Time cost=89.23, Throughput=53.79 samples/s ETA=1.00h
2020-07-11 01:19:57,105 - root - INFO - Epoch: 2, Batch: 4434/8164, Loss span/answer/total=0.5365/0.1209/0.6574, LR=0.00001079, grad_norm=19.0388. Time cost=92.57, Throughput=51.85 samples/s ETA=0.97h
2020-07-11 01:21:26,840 - root - INFO - Epoch: 2, Batch: 4734/8164, Loss span/answer/total=0.5208/0.1207/0.6414, LR=0.00001052, grad_norm=19.5382. Time cost=89.73, Throughput=53.49 samples/s ETA=0.95h
2020-07-11 01:22:57,169 - root - INFO - Epoch: 2, Batch: 5034/8164, Loss span/answer/total=0.5307/0.1079/0.6386, LR=0.00001024, grad_norm=27.5455. Time cost=90.33, Throughput=53.14 samples/s ETA=0.93h
2020-07-11 01:24:23,953 - root - INFO - Epoch: 2, Batch: 5334/8164, Loss span/answer/total=0.5372/0.1120/0.6493, LR=0.00000997, grad_norm=26.2957. Time cost=86.78, Throughput=55.31 samples/s ETA=0.90h
2020-07-11 01:25:52,240 - root - INFO - Epoch: 2, Batch: 5634/8164, Loss span/answer/total=0.5274/0.1071/0.6344, LR=0.00000970, grad_norm=24.8889. Time cost=88.29, Throughput=54.37 samples/s ETA=0.88h
2020-07-11 01:27:21,081 - root - INFO - Epoch: 2, Batch: 5934/8164, Loss span/answer/total=0.5284/0.1129/0.6413, LR=0.00000943, grad_norm=18.0035. Time cost=88.84, Throughput=54.03 samples/s ETA=0.85h
2020-07-11 01:28:49,901 - root - INFO - Epoch: 2, Batch: 6234/8164, Loss span/answer/total=0.5144/0.1081/0.6225, LR=0.00000916, grad_norm=21.3835. Time cost=88.82, Throughput=54.04 samples/s ETA=0.83h
2020-07-11 01:30:20,586 - root - INFO - Epoch: 2, Batch: 6534/8164, Loss span/answer/total=0.5259/0.1100/0.6360, LR=0.00000888, grad_norm=18.5121. Time cost=90.68, Throughput=52.93 samples/s ETA=0.80h
2020-07-11 01:31:49,923 - root - INFO - Epoch: 2, Batch: 6834/8164, Loss span/answer/total=0.5204/0.1085/0.6289, LR=0.00000861, grad_norm=21.6928. Time cost=89.34, Throughput=53.73 samples/s ETA=0.78h
2020-07-11 01:33:19,798 - root - INFO - Epoch: 2, Batch: 7134/8164, Loss span/answer/total=0.4833/0.1047/0.5880, LR=0.00000834, grad_norm=24.0331. Time cost=89.87, Throughput=53.41 samples/s ETA=0.75h
2020-07-11 01:34:49,695 - root - INFO - Epoch: 2, Batch: 7434/8164, Loss span/answer/total=0.5246/0.1093/0.6339, LR=0.00000807, grad_norm=24.4552. Time cost=89.90, Throughput=53.39 samples/s ETA=0.73h
2020-07-11 01:36:22,013 - root - INFO - Epoch: 2, Batch: 7734/8164, Loss span/answer/total=0.5304/0.1147/0.6451, LR=0.00000780, grad_norm=19.8521. Time cost=92.32, Throughput=51.99 samples/s ETA=0.71h
2020-07-11 01:37:51,216 - root - INFO - Epoch: 2, Batch: 8034/8164, Loss span/answer/total=0.5066/0.1048/0.6114, LR=0.00000752, grad_norm=15.9122. Time cost=89.20, Throughput=53.81 samples/s ETA=0.68h
2020-07-11 01:38:29,329 - root - INFO - Params saved in: fintune_google_albert_base_v2_squad_2.0/google_albert_base_v2_squad2.0_5442.params
2020-07-11 01:38:30,315 - root - INFO - Epoch: 2, #Samples: 130614, Throughput=53.58 samples/s
2020-07-11 01:39:20,575 - root - INFO - Epoch: 3, Batch: 168/8164, Loss span/answer/total=0.4661/0.0828/0.5489, LR=0.00000725, grad_norm=24.4564. Time cost=50.26, Throughput=94.67 samples/s ETA=0.66h
2020-07-11 01:40:49,763 - root - INFO - Epoch: 3, Batch: 468/8164, Loss span/answer/total=0.4246/0.0746/0.4992, LR=0.00000698, grad_norm=22.9640. Time cost=89.19, Throughput=53.82 samples/s ETA=0.63h
2020-07-11 01:42:18,911 - root - INFO - Epoch: 3, Batch: 768/8164, Loss span/answer/total=0.4387/0.0672/0.5058, LR=0.00000671, grad_norm=23.1538. Time cost=89.15, Throughput=53.84 samples/s ETA=0.61h
2020-07-11 01:43:48,681 - root - INFO - Epoch: 3, Batch: 1068/8164, Loss span/answer/total=0.4437/0.0735/0.5172, LR=0.00000643, grad_norm=28.1258. Time cost=89.77, Throughput=53.47 samples/s ETA=0.58h
2020-07-11 01:45:18,576 - root - INFO - Epoch: 3, Batch: 1368/8164, Loss span/answer/total=0.4201/0.0618/0.4819, LR=0.00000616, grad_norm=16.5424. Time cost=89.89, Throughput=53.40 samples/s ETA=0.56h
2020-07-11 01:46:48,423 - root - INFO - Epoch: 3, Batch: 1668/8164, Loss span/answer/total=0.4129/0.0766/0.4895, LR=0.00000589, grad_norm=32.6500. Time cost=89.85, Throughput=53.42 samples/s ETA=0.53h
2020-07-11 01:48:16,620 - root - INFO - Epoch: 3, Batch: 1968/8164, Loss span/answer/total=0.4147/0.0735/0.4882, LR=0.00000562, grad_norm=15.0941. Time cost=88.20, Throughput=54.42 samples/s ETA=0.51h
2020-07-11 01:49:45,692 - root - INFO - Epoch: 3, Batch: 2268/8164, Loss span/answer/total=0.4097/0.0627/0.4724, LR=0.00000535, grad_norm=23.8714. Time cost=89.07, Throughput=53.89 samples/s ETA=0.48h
2020-07-11 01:51:12,197 - root - INFO - Epoch: 3, Batch: 2568/8164, Loss span/answer/total=0.4157/0.0662/0.4819, LR=0.00000507, grad_norm=39.7953. Time cost=86.50, Throughput=55.49 samples/s ETA=0.46h
2020-07-11 01:52:42,568 - root - INFO - Epoch: 3, Batch: 2868/8164, Loss span/answer/total=0.4254/0.0686/0.4940, LR=0.00000480, grad_norm=26.7246. Time cost=90.37, Throughput=53.11 samples/s ETA=0.43h
2020-07-11 01:54:10,287 - root - INFO - Epoch: 3, Batch: 3168/8164, Loss span/answer/total=0.4014/0.0623/0.4637, LR=0.00000453, grad_norm=27.3254. Time cost=87.72, Throughput=54.72 samples/s ETA=0.41h
2020-07-11 01:55:41,791 - root - INFO - Epoch: 3, Batch: 3468/8164, Loss span/answer/total=0.4173/0.0689/0.4862, LR=0.00000426, grad_norm=22.7920. Time cost=91.50, Throughput=52.46 samples/s ETA=0.39h
2020-07-11 01:57:12,912 - root - INFO - Epoch: 3, Batch: 3768/8164, Loss span/answer/total=0.4064/0.0651/0.4715, LR=0.00000398, grad_norm=15.8398. Time cost=91.12, Throughput=52.68 samples/s ETA=0.36h
2020-07-11 01:58:41,666 - root - INFO - Epoch: 3, Batch: 4068/8164, Loss span/answer/total=0.4026/0.0652/0.4678, LR=0.00000371, grad_norm=26.2612. Time cost=88.75, Throughput=54.08 samples/s ETA=0.34h
2020-07-11 02:00:10,853 - root - INFO - Epoch: 3, Batch: 4368/8164, Loss span/answer/total=0.4241/0.0716/0.4958, LR=0.00000344, grad_norm=21.3658. Time cost=89.19, Throughput=53.82 samples/s ETA=0.31h
2020-07-11 02:01:41,080 - root - INFO - Epoch: 3, Batch: 4668/8164, Loss span/answer/total=0.4013/0.0650/0.4663, LR=0.00000317, grad_norm=28.3836. Time cost=90.23, Throughput=53.20 samples/s ETA=0.29h
2020-07-11 02:03:10,414 - root - INFO - Epoch: 3, Batch: 4968/8164, Loss span/answer/total=0.4187/0.0633/0.4820, LR=0.00000290, grad_norm=18.7006. Time cost=89.33, Throughput=53.73 samples/s ETA=0.26h
2020-07-11 02:04:39,266 - root - INFO - Epoch: 3, Batch: 5268/8164, Loss span/answer/total=0.4143/0.0760/0.4903, LR=0.00000262, grad_norm=26.4106. Time cost=88.85, Throughput=54.02 samples/s ETA=0.24h
2020-07-11 02:06:10,332 - root - INFO - Epoch: 3, Batch: 5568/8164, Loss span/answer/total=0.4046/0.0643/0.4688, LR=0.00000235, grad_norm=19.0901. Time cost=91.07, Throughput=52.71 samples/s ETA=0.21h
2020-07-11 02:07:38,445 - root - INFO - Epoch: 3, Batch: 5868/8164, Loss span/answer/total=0.3997/0.0698/0.4695, LR=0.00000208, grad_norm=23.9702. Time cost=88.11, Throughput=54.48 samples/s ETA=0.19h
2020-07-11 02:09:06,757 - root - INFO - Epoch: 3, Batch: 6168/8164, Loss span/answer/total=0.4099/0.0662/0.4761, LR=0.00000181, grad_norm=10.8356. Time cost=88.31, Throughput=54.35 samples/s ETA=0.16h
2020-07-11 02:10:36,438 - root - INFO - Epoch: 3, Batch: 6468/8164, Loss span/answer/total=0.4174/0.0617/0.4791, LR=0.00000154, grad_norm=24.9383. Time cost=89.68, Throughput=53.52 samples/s ETA=0.14h
2020-07-11 02:12:06,180 - root - INFO - Epoch: 3, Batch: 6768/8164, Loss span/answer/total=0.3898/0.0629/0.4527, LR=0.00000126, grad_norm=20.0646. Time cost=89.74, Throughput=53.49 samples/s ETA=0.11h
2020-07-11 02:13:35,502 - root - INFO - Epoch: 3, Batch: 7068/8164, Loss span/answer/total=0.3961/0.0620/0.4581, LR=0.00000099, grad_norm=17.5214. Time cost=89.32, Throughput=53.74 samples/s ETA=0.09h
2020-07-11 02:15:06,476 - root - INFO - Epoch: 3, Batch: 7368/8164, Loss span/answer/total=0.4007/0.0623/0.4631, LR=0.00000072, grad_norm=23.8710. Time cost=90.97, Throughput=52.76 samples/s ETA=0.07h
2020-07-11 02:16:37,243 - root - INFO - Epoch: 3, Batch: 7668/8164, Loss span/answer/total=0.3994/0.0652/0.4646, LR=0.00000045, grad_norm=23.1507. Time cost=90.77, Throughput=52.88 samples/s ETA=0.04h
2020-07-11 02:18:04,668 - root - INFO - Epoch: 3, Batch: 7968/8164, Loss span/answer/total=0.4066/0.0667/0.4733, LR=0.00000017, grad_norm=26.5468. Time cost=87.42, Throughput=54.90 samples/s ETA=0.02h
2020-07-11 02:18:59,955 - root - INFO - Params saved in: fintune_google_albert_base_v2_squad_2.0/google_albert_base_v2_squad2.0_8163.params
2020-07-11 02:19:01,060 - root - INFO - Params saved in: fintune_google_albert_base_v2_squad_2.0/google_albert_base_v2_squad2.0_8164.params
2020-07-11 02:19:01,060 - root - INFO - Finish training step: 8164
2020-07-11 02:19:01,061 - root - INFO - Epoch: 3, #Samples: 130560, Throughput=53.71 samples/s
2020-07-11 02:19:02,585 - root - INFO - Loading Backbone Model from /home/ubuntu/.mxnet/models/nlp/google_albert_base_v2/model-125be477.params, with total/fixd parameters=11092992/0
/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py:568: UserWarning: Parameter 'weight' is already initialized, ignoring. Set force_reinit=True to re-initialize.
  v.initialize(None, ctx, init, force_reinit=force_reinit)
/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py:568: UserWarning: Parameter 'bias' is already initialized, ignoring. Set force_reinit=True to re-initialize.
  v.initialize(None, ctx, init, force_reinit=force_reinit)
/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py:568: UserWarning: Parameter 'gamma' is already initialized, ignoring. Set force_reinit=True to re-initialize.
  v.initialize(None, ctx, init, force_reinit=force_reinit)
/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py:568: UserWarning: Parameter 'beta' is already initialized, ignoring. Set force_reinit=True to re-initialize.
  v.initialize(None, ctx, init, force_reinit=force_reinit)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:00<00:00, 1828.18it/s]
2020-07-11 02:19:02,666 - root - INFO - Tokenize Dev Data:
2020-07-11 02:19:05,886 - root - INFO - Done! Time spent:3.22 seconds
2020-07-11 02:19:07,676 - root - INFO - Starting evaluate the checkpoint google_albert_base_v2_squad2.0_8164.params
2020-07-11 02:21:22,663 - root - INFO - [batch 100], Time cost=134.79, Throughput=47.48 samples/s, ETA=0.03h
2020-07-11 02:23:13,264 - root - INFO - Time cost=245.395731 s, Thoughput=49.02 samples/s
2020-07-11 02:23:19,879 - root - INFO - The evaluated results are {"exact": 41.767034447907015, "f1": 45.275399862631694, "total": 11873, "HasAns_exact": 83.65384615384616, "HasAns_f1": 90.68063808519335, "HasAns_total": 5928, "NoAns_exact": 0.0, "NoAns_f1": 0.0, "NoAns_total": 5945, "best_exact": 79.27229849237766, "best_exact_thresh": -2.0645291805267334, "best_f1": 82.0919530894437, "best_f1_thresh": -1.8526396751403809}
2020-07-11 02:23:19,879 - root - INFO - The evaluated files are saved in fintune_google_albert_base_v2_squad_2.0
2020-07-11 02:23:20,927 - root - INFO - The best evaluated results are {"exact": 41.767034447907015, "f1": 45.275399862631694, "total": 11873, "HasAns_exact": 83.65384615384616, "HasAns_f1": 90.68063808519335, "HasAns_total": 5928, "NoAns_exact": 0.0, "NoAns_f1": 0.0, "NoAns_total": 5945, "best_exact": 79.27229849237766, "best_exact_thresh": -2.0645291805267334, "best_f1": 82.0919530894437, "best_f1_thresh": -1.8526396751403809, "best_ckpt": "google_albert_base_v2_squad2.0_8164.params"}

szha commented 4 years ago

@dmlc/gluon-nlp-committers let's halt other merges in the numpy branch to yield for this change.

szha commented 4 years ago

2020-07-11 02:23:20,927 - root - INFO - The best evaluated results are {"exact": 41.767034447907015, "f1": 45.275399862631694, "total": 11873, "HasAns_exact": 83.65384615384616, "HasAns_f1": 90.68063808519335, "HasAns_total": 5928, "NoAns_exact": 0.0, "NoAns_f1": 0.0, "NoAns_total": 5945, "best_exact": 79.27229849237766, "best_exact_thresh": -2.0645291805267334, "best_f1": 82.0919530894437, "best_f1_thresh": -1.8526396751403809, "best_ckpt": "google_albert_base_v2_squad2.0_8164.params"}

@sxjscience is this performance as expected?

sxjscience commented 4 years ago

2020-07-11 02:23:20,927 - root - INFO - The best evaluated results are {"exact": 41.767034447907015, "f1": 45.275399862631694, "total": 11873, "HasAns_exact": 83.65384615384616, "HasAns_f1": 90.68063808519335, "HasAns_total": 5928, "NoAns_exact": 0.0, "NoAns_f1": 0.0, "NoAns_total": 5945, "best_exact": 79.27229849237766, "best_exact_thresh": -2.0645291805267334, "best_f1": 82.0919530894437, "best_f1_thresh": -1.8526396751403809, "best_ckpt": "google_albert_base_v2_squad2.0_8164.params"}

@sxjscience is this performance as expected?

Yes, we need to check "best_exact" and "best_f1".

zheyuye commented 4 years ago

@szha @sxjscience Yes, the best_f1 and best_exact are resonable compared to previous results https://github.com/dmlc/gluon-nlp/tree/numpy/scripts/question_answering#results

leezu commented 4 years ago

Thanks to @leezu for revising, this is generally good but may require some extra efforts on the conversion toolkits which are highly dependent on the prefix as https://github.com/leezu/gluon-nlp/blob/a79101da3a40d5212e419fa1f46a40e9ad3e7eb3/scripts/conversion_toolkits/convert_tf_hub_model.py#L134-L166

@ZheyuYe generally the scripts can be updated by replacing the prefix with the respective attribute name in the Python block. In this codebase, they are mostly the same or very similar, such as _rel_pos_embed vs rel_pos_embed

self._rel_pos_embed = BucketPositionalEmbedding(
                    units=num_heads,
                    num_buckets=self._num_buckets,
                    max_distance=self._max_distance,
                    bidirectional=self._bidirectional,
                    prefix='rel_pos_embed_',

It' can be done in a separate PR if we decide to keep the conversion scripts in the release (which may require adding tests).

zheyuye commented 4 years ago

@leezu This sounds great. Let's leave conversion scripts alone for now and re-revise these with some useful test cases once this PR merged.

leezu commented 4 years ago

2020-07-16 18:12:12,077 - root - INFO - Time Spent: 212.10484337806702, #Sent=2737, SacreBlEU=26.621931302568633 Avg NLL=1.37640975370957, Perplexity=3.9606563390205163

train_transformer.log

sxjscience commented 4 years ago

@leezu It's expected. I propose to merge this in.

codecov[bot] commented 4 years ago

Codecov Report

Merging #1261 into numpy will increase coverage by 0.01%. The diff coverage is 86.55%.

@@            Coverage Diff             @@
##            numpy    #1261      +/-   ##
==========================================
+ Coverage   82.52%   82.53%   +0.01%     
==========================================
  Files          38       38              
  Lines        5500     5446      -54     
==========================================
- Hits         4539     4495      -44     
+ Misses        961      951      -10

Impacted Files	Coverage Δ
src/gluonnlp/lr_scheduler.py	`45.45% <0.00%> (ø)`
src/gluonnlp/models/bert.py	`84.42% <66.66%> (+0.10%)`	:arrow_up:
src/gluonnlp/models/mobilebert.py	`81.35% <76.81%> (-0.10%)`	:arrow_down:
src/gluonnlp/attention_cell.py	`79.91% <80.95%> (-0.09%)`	:arrow_down:
src/gluonnlp/models/transformer_xl.py	`82.71% <81.25%> (-0.22%)`	:arrow_down:
src/gluonnlp/models/electra.py	`78.86% <88.23%> (+1.61%)`	:arrow_up:
src/gluonnlp/layers.py	`86.78% <92.98%> (-0.45%)`	:arrow_down:
src/gluonnlp/models/transformer.py	`95.95% <94.91%> (-0.04%)`	:arrow_down:
src/gluonnlp/models/roberta.py	`93.64% <95.65%> (+0.38%)`	:arrow_up:
src/gluonnlp/data/loading.py	`83.39% <100.00%> (ø)`
... and 6 more

dmlc / gluon-nlp

Update for Block API #1261

Codecov Report