Closed qihan96 closed 1 year ago
Thank you for your interest in our work. Optimal results are typically attained when the loss hovers around 1. In your particular case, the loss appears to be considerably high. You might consider adjusting the batch size to a larger value or reducing the learning rate. Our training log in _alpacallava.yaml could provide valuable insights. The training was conducted across 8 GPUs, yielding an effective batch size of 4096 and a learning rate of 3e-5.
[01:33:16.135115] effective batch size: 4096 [01:33:16.140969] FusedAdam ( Parameter Group 0 betas: (0.9, 0.95) bias_correction: True eps: 1e-08 lr: 3e-05 weight_decay: 0.0
Parameter Group 1 betas: (0.9, 0.95) bias_correction: True eps: 1e-08 lr: 3e-05 weight_decay: 0.02 ) [01:33:16.141096] read dataset config from configs/data/finetune/mm/alpaca_llava.yaml [01:33:16.142847] DATASET CONFIG: [01:33:16.142890] {'META': [['../data/alpaca_gpt4_data.json', 'text'], ['./data/annotations/llava_instruct_150k.json', 'image_text']]} [01:33:16.414220] ../data/alpaca_gpt4_data.json, typetext: len 52002 [01:33:17.422696] ./data/annotations/llava_instruct_150k.json, typeimage_text: len 144794 [01:33:17.433995] total length: 196796 [01:33:17.448037] <data.alpaca.FinetuneDataset object at 0x7f5c0cf52350> [01:33:17.452383] Start training for 3 epochs [01:33:17.461609] log_dir: ./output_dir [01:33:27.164930] Epoch: [0] [0/1504] lr: 0.000000 closs: 1.5440 (1.5440) time: 9.7016 data: 1.4028 max mem: 26869 [01:34:12.610327] Epoch: [0] [10/1504] lr: 0.000000 closs: 1.5367 (1.5368) time: 5.0133 data: 0.1277 max mem: 39297 [01:34:58.290992] Epoch: [0] [20/1504] lr: 0.000000 closs: 1.5362 (1.5382) time: 4.5562 data: 0.0002 max mem: 39297 [01:35:43.927782] Epoch: [0] [30/1504] lr: 0.000000 closs: 1.5318 (1.5380) time: 4.5658 data: 0.0002 max mem: 39297 [01:36:29.695125] Epoch: [0] [40/1504] lr: 0.000003 closs: 1.5428 (1.5405) grad_norm: 5.2397 (5.2397) time: 4.5701 data: 0.0002 max mem: 51708 [01:37:15.306165] Epoch: [0] [50/1504] lr: 0.000003 closs: 1.5234 (1.5350) grad_norm: 5.2397 (5.2397) time: 4.5688 data: 0.0002 max mem: 51708 [01:38:00.980330] Epoch: [0] [60/1504] lr: 0.000003 closs: 1.5201 (1.5346) grad_norm: 5.2397 (5.2397) time: 4.5642 data: 0.0002 max mem: 51708 [01:38:46.891616] Epoch: [0] [70/1504] lr: 0.000006 closs: 1.5242 (1.5309) grad_norm: 5.2397 (5.2410) time: 4.5792 data: 0.0002 max mem: 51710 [01:39:32.533681] Epoch: [0] [80/1504] lr: 0.000006 closs: 1.5069 (1.5298) grad_norm: 5.2397 (5.2410) time: 4.5776 data: 0.0002 max mem: 51710 [01:40:18.172497] Epoch: [0] [90/1504] lr: 0.000006 closs: 1.5069 (1.5286) grad_norm: 5.2397 (5.2410) time: 4.5639 data: 0.0002 max mem: 51710 [01:41:03.857179] Epoch: [0] [100/1504] lr: 0.000010 closs: 1.4768 (1.5214) grad_norm: 5.2397 (5.1645) time: 4.5661 data: 0.0002 max mem: 51710 [01:41:49.476173] Epoch: [0] [110/1504] lr: 0.000010 closs: 1.4001 (1.5096) grad_norm: 5.2397 (5.1645) time: 4.5651 data: 0.0002 max mem: 51710 [01:42:35.055249] Epoch: [0] [120/1504] lr: 0.000010 closs: 1.4148 (1.5009) grad_norm: 5.2397 (5.1645) time: 4.5598 data: 0.0002 max mem: 51710 [01:43:20.758113] Epoch: [0] [130/1504] lr: 0.000013 closs: 1.3779 (1.4888) grad_norm: 5.0116 (4.5031) time: 4.5640 data: 0.0002 max mem: 51710 [01:44:06.360168] Epoch: [0] [140/1504] lr: 0.000013 closs: 1.3035 (1.4744) grad_norm: 5.0116 (4.5031) time: 4.5652 data: 0.0002 max mem: 51710 [01:44:51.924395] Epoch: [0] [150/1504] lr: 0.000013 closs: 1.3035 (1.4641) grad_norm: 5.0116 (4.5031) time: 4.5582 data: 0.0002 max mem: 51710 [01:45:37.602991] Epoch: [0] [160/1504] lr: 0.000016 closs: 1.3219 (1.4553) grad_norm: 5.0116 (4.0958) time: 4.5621 data: 0.0002 max mem: 51710 [01:46:23.519850] Epoch: [0] [170/1504] lr: 0.000016 closs: 1.2827 (1.4446) grad_norm: 5.0116 (4.0958) time: 4.5797 data: 0.0002 max mem: 51710 [01:47:09.184635] Epoch: [0] [180/1504] lr: 0.000016 closs: 1.2814 (1.4364) grad_norm: 5.0116 (4.0958) time: 4.5790 data: 0.0001 max mem: 51710 [01:47:54.851141] Epoch: [0] [190/1504] lr: 0.000016 closs: 1.2894 (1.4276) grad_norm: 5.0116 (4.0958) time: 4.5665 data: 0.0002 max mem: 51710 [01:48:36.760105] Epoch: [0] [200/1504] lr: 0.000019 closs: 1.2155 (1.4103) grad_norm: 2.5190 (3.7287) time: 4.3787 data: 0.0002 max mem: 51710 [01:49:18.164817] Epoch: [0] [210/1504] lr: 0.000019 closs: 1.0563 (1.3920) grad_norm: 2.5190 (3.7287) time: 4.1655 data: 0.0002 max mem: 51710 [01:49:59.529031] Epoch: [0] [220/1504] lr: 0.000019 closs: 1.0462 (1.3771) grad_norm: 2.5190 (3.7287) time: 4.1383 data: 0.0002 max mem: 51710 [01:50:43.998510] Epoch: [0] [230/1504] lr: 0.000022 closs: 1.1248 (1.3708) grad_norm: 2.8653 (3.6053) time: 4.2916 data: 0.0002 max mem: 51710 [01:51:29.888639] Epoch: [0] [240/1504] lr: 0.000022 closs: 1.2674 (1.3672) grad_norm: 2.8653 (3.6053) time: 4.5179 data: 0.0002 max mem: 51710 [01:52:15.495147] Epoch: [0] [250/1504] lr: 0.000022 closs: 1.2743 (1.3640) grad_norm: 2.8653 (3.6053) time: 4.5747 data: 0.0001 max mem: 51710 [01:53:01.169871] Epoch: [0] [260/1504] lr: 0.000026 closs: 1.2684 (1.3605) grad_norm: 2.5190 (3.3888) time: 4.5639 data: 0.0002 max mem: 51710 [01:53:46.764507] Epoch: [0] [270/1504] lr: 0.000026 closs: 1.2660 (1.3568) grad_norm: 2.5190 (3.3888) time: 4.5634 data: 0.0002 max mem: 51710 [01:54:32.334670] Epoch: [0] [280/1504] lr: 0.000026 closs: 1.2660 (1.3535) grad_norm: 2.5190 (3.3888) time: 4.5581 data: 0.0002 max mem: 51710 [01:55:16.702948] Epoch: [0] [290/1504] lr: 0.000029 closs: 1.2382 (1.3478) grad_norm: 2.5190 (3.1731) time: 4.4968 data: 0.0002 max mem: 51710 [01:55:58.053955] Epoch: [0] [300/1504] lr: 0.000029 closs: 1.0619 (1.3367) grad_norm: 2.5190 (3.1731) time: 4.2859 data: 0.0002 max mem: 51710 [01:56:39.727558] Epoch: [0] [310/1504] lr: 0.000029 closs: 1.0160 (1.3265) grad_norm: 2.5190 (3.1731) time: 4.1512 data: 0.0002 max mem: 51710 [01:57:21.651061] Epoch: [0] [320/1504] lr: 0.000030 closs: 0.9916 (1.3163) grad_norm: 2.4667 (3.0052) time: 4.1798 data: 0.0002 max mem: 51710 [01:58:07.329527] Epoch: [0] [330/1504] lr: 0.000030 closs: 1.1698 (1.3141) grad_norm: 2.4667 (3.0052) time: 4.3800 data: 0.0002 max mem: 51710 [01:58:53.220553] Epoch: [0] [340/1504] lr: 0.000030 closs: 1.2272 (1.3119) grad_norm: 2.4667 (3.0052) time: 4.5784 data: 0.0002 max mem: 51710 [01:59:39.119161] Epoch: [0] [350/1504] lr: 0.000030 closs: 1.2272 (1.3101) grad_norm: 2.4667 (3.0052) time: 4.5894 data: 0.0002 max mem: 51710 [02:00:24.905296] Epoch: [0] [360/1504] lr: 0.000030 closs: 1.2304 (1.3076) grad_norm: 2.4667 (2.8327) time: 4.5841 data: 0.0002 max mem: 51710 [02:01:10.551769] Epoch: [0] [370/1504] lr: 0.000030 closs: 1.2304 (1.3054) grad_norm: 2.4667 (2.8327) time: 4.5715 data: 0.0002 max mem: 51710 [02:01:56.252432] Epoch: [0] [380/1504] lr: 0.000030 closs: 1.2293 (1.3033) grad_norm: 2.4667 (2.8327) time: 4.5673 data: 0.0002 max mem: 51710 [02:02:39.206068] Epoch: [0] [390/1504] lr: 0.000030 closs: 1.1506 (1.2965) grad_norm: 1.8928 (2.6668) time: 4.4326 data: 0.0002 max mem: 51710 [02:03:20.622436] Epoch: [0] [400/1504] lr: 0.000030 closs: 0.9877 (1.2888) grad_norm: 1.8928 (2.6668) time: 4.2184 data: 0.0002 max mem: 51710 [02:04:02.060869] Epoch: [0] [410/1504] lr: 0.000030 closs: 0.9718 (1.2813) grad_norm: 1.8928 (2.6668) time: 4.1427 data: 0.0001 max mem: 51710 [02:04:45.957882] Epoch: [0] [420/1504] lr: 0.000030 closs: 1.0121 (1.2768) grad_norm: 1.8928 (2.5815) time: 4.2667 data: 0.0001 max mem: 51710 [02:05:31.773505] Epoch: [0] [430/1504] lr: 0.000030 closs: 1.1706 (1.2746) grad_norm: 1.8928 (2.5815) time: 4.4856 data: 0.0002 max mem: 51710 [02:06:17.447776] Epoch: [0] [440/1504] lr: 0.000030 closs: 1.1876 (1.2729) grad_norm: 1.8928 (2.5815) time: 4.5744 data: 0.0002 max mem: 51710 [02:07:03.213373] Epoch: [0] [450/1504] lr: 0.000030 closs: 1.1917 (1.2711) grad_norm: 1.8728 (2.4497) time: 4.5719 data: 0.0002 max mem: 51710 [02:07:49.045082] Epoch: [0] [460/1504] lr: 0.000030 closs: 1.1905 (1.2693) grad_norm: 1.8728 (2.4497) time: 4.5798 data: 0.0002 max mem: 51710 [02:08:34.750488] Epoch: [0] [470/1504] lr: 0.000030 closs: 1.1597 (1.2668) grad_norm: 1.8728 (2.4497) time: 4.5768 data: 0.0002 max mem: 51710 [02:09:20.534491] Epoch: [0] [480/1504] lr: 0.000030 closs: 1.1649 (1.2649) grad_norm: 1.8728 (2.3280) time: 4.5744 data: 0.0002 max mem: 51710 [02:10:06.389679] Epoch: [0] [490/1504] lr: 0.000030 closs: 1.1661 (1.2630) grad_norm: 1.8728 (2.3280) time: 4.5819 data: 0.0002 max mem: 51710 [02:10:52.295231] Epoch: [0] [500/1504] lr: 0.000030 closs: 1.1522 (1.2610) grad_norm: 1.8728 (2.3280) time: 4.5880 data: 0.0002 max mem: 51710 [02:11:37.932013] Epoch: [0] [510/1504] lr: 0.000030 closs: 1.1706 (1.2594) grad_norm: 1.8728 (2.3280) time: 4.5770 data: 0.0002 max mem: 51710 [02:12:23.683626] Epoch: [0] [520/1504] lr: 0.000030 closs: 1.1695 (1.2577) grad_norm: 1.5589 (2.2127) time: 4.5693 data: 0.0002 max mem: 51710 [02:13:09.363386] Epoch: [0] [530/1504] lr: 0.000030 closs: 1.1486 (1.2556) grad_norm: 1.5589 (2.2127) time: 4.5715 data: 0.0002 max mem: 51710 [02:13:55.054600] Epoch: [0] [540/1504] lr: 0.000030 closs: 1.1439 (1.2539) grad_norm: 1.5589 (2.2127) time: 4.5685 data: 0.0002 max mem: 51710 [02:14:40.845058] Epoch: [0] [550/1504] lr: 0.000030 closs: 1.1698 (1.2523) grad_norm: 1.5589 (2.1127) time: 4.5740 data: 0.0002 max mem: 51710 [02:15:26.688088] Epoch: [0] [560/1504] lr: 0.000030 closs: 1.1532 (1.2504) grad_norm: 1.5589 (2.1127) time: 4.5816 data: 0.0002 max mem: 51710 [02:16:12.399214] Epoch: [0] [570/1504] lr: 0.000030 closs: 1.1532 (1.2491) grad_norm: 1.5589 (2.1127) time: 4.5776 data: 0.0002 max mem: 51710 [02:16:59.298380] Epoch: [0] [580/1504] lr: 0.000030 closs: 1.1372 (1.2468) grad_norm: 1.4939 (2.0214) time: 4.6304 data: 0.0002 max mem: 51710 [02:17:45.254815] Epoch: [0] [590/1504] lr: 0.000030 closs: 1.1306 (1.2450) grad_norm: 1.4939 (2.0214) time: 4.6426 data: 0.0002 max mem: 51710 [02:18:30.947760] Epoch: [0] [600/1504] lr: 0.000030 closs: 1.1530 (1.2434) grad_norm: 1.4939 (2.0214) time: 4.5823 data: 0.0002 max mem: 51710 [02:19:16.651804] Epoch: [0] [610/1504] lr: 0.000030 closs: 1.1238 (1.2414) grad_norm: 1.4939 (1.9414) time: 4.5697 data: 0.0002 max mem: 51710 [02:20:02.360981] Epoch: [0] [620/1504] lr: 0.000030 closs: 1.1290 (1.2399) grad_norm: 1.4939 (1.9414) time: 4.5705 data: 0.0002 max mem: 51710 [02:20:48.042935] Epoch: [0] [630/1504] lr: 0.000030 closs: 1.1437 (1.2383) grad_norm: 1.4939 (1.9414) time: 4.5694 data: 0.0002 max mem: 51710 [02:21:33.393964] Epoch: [0] [640/1504] lr: 0.000030 closs: 1.1410 (1.2365) grad_norm: 1.4480 (1.8668) time: 4.5515 data: 0.0003 max mem: 51710 [02:22:14.901191] Epoch: [0] [650/1504] lr: 0.000030 closs: 1.0589 (1.2326) grad_norm: 1.4480 (1.8668) time: 4.3428 data: 0.0003 max mem: 51710 [02:22:57.071098] Epoch: [0] [660/1504] lr: 0.000030 closs: 0.9899 (1.2286) grad_norm: 1.4480 (1.8668) time: 4.1837 data: 0.0003 max mem: 51710 [02:23:38.630928] Epoch: [0] [670/1504] lr: 0.000030 closs: 0.9728 (1.2247) grad_norm: 1.4480 (1.8668) time: 4.1864 data: 0.0003 max mem: 51710 [02:24:20.950217] Epoch: [0] [680/1504] lr: 0.000030 closs: 0.9449 (1.2205) grad_norm: 1.2513 (1.8375) time: 4.1938 data: 0.0003 max mem: 51710 [02:25:02.683804] Epoch: [0] [690/1504] lr: 0.000030 closs: 0.9253 (1.2163) grad_norm: 1.2513 (1.8375) time: 4.2025 data: 0.0003 max mem: 51710 [02:25:44.212032] Epoch: [0] [700/1504] lr: 0.000030 closs: 0.9388 (1.2129) grad_norm: 1.2513 (1.8375) time: 4.1629 data: 0.0003 max mem: 51710 [02:26:25.863592] Epoch: [0] [710/1504] lr: 0.000029 closs: 0.9572 (1.2090) grad_norm: 1.1074 (1.7838) time: 4.1588 data: 0.0003 max mem: 51710 [02:27:07.362960] Epoch: [0] [720/1504] lr: 0.000029 closs: 0.9188 (1.2051) grad_norm: 1.1074 (1.7838) time: 4.1574 data: 0.0003 max mem: 51710 [02:27:49.356961] Epoch: [0] [730/1504] lr: 0.000029 closs: 0.9234 (1.2015) grad_norm: 1.1074 (1.7838) time: 4.1745 data: 0.0003 max mem: 51710 [02:28:33.263611] Epoch: [0] [740/1504] lr: 0.000029 closs: 0.9438 (1.1995) grad_norm: 0.8416 (1.7277) time: 4.2949 data: 0.0003 max mem: 51710 [02:29:19.015150] Epoch: [0] [750/1504] lr: 0.000029 closs: 1.1603 (1.1993) grad_norm: 0.8416 (1.7277) time: 4.4828 data: 0.0003 max mem: 51710 [02:30:04.976669] Epoch: [0] [760/1504] lr: 0.000029 closs: 1.1682 (1.1989) grad_norm: 0.8416 (1.7277) time: 4.5855 data: 0.0002 max mem: 51710 [02:30:51.494024] Epoch: [0] [770/1504] lr: 0.000029 closs: 1.1542 (1.1982) grad_norm: 0.8416 (1.7108) time: 4.6238 data: 0.0003 max mem: 51710 [02:31:37.247699] Epoch: [0] [780/1504] lr: 0.000029 closs: 1.1345 (1.1973) grad_norm: 0.8416 (1.7108) time: 4.6134 data: 0.0003 max mem: 51710 [02:32:22.991293] Epoch: [0] [790/1504] lr: 0.000029 closs: 1.1147 (1.1965) grad_norm: 0.8416 (1.7108) time: 4.5747 data: 0.0003 max mem: 51710 [02:33:08.991834] Epoch: [0] [800/1504] lr: 0.000029 closs: 1.1153 (1.1956) grad_norm: 0.7359 (1.6628) time: 4.5870 data: 0.0003 max mem: 51710 [02:33:54.758587] Epoch: [0] [810/1504] lr: 0.000029 closs: 1.1066 (1.1945) grad_norm: 0.7359 (1.6628) time: 4.5882 data: 0.0002 max mem: 51710 [02:34:40.525510] Epoch: [0] [820/1504] lr: 0.000029 closs: 1.1210 (1.1942) grad_norm: 0.7359 (1.6628) time: 4.5765 data: 0.0003 max mem: 51710 [02:35:27.361669] Epoch: [0] [830/1504] lr: 0.000029 closs: 1.1700 (1.1936) grad_norm: 0.7359 (1.6628) time: 4.6300 data: 0.0002 max mem: 51710 [02:36:09.447388] Epoch: [0] [840/1504] lr: 0.000029 closs: 1.0668 (1.1909) grad_norm: 0.7039 (1.6259) time: 4.4460 data: 0.0002 max mem: 51710 [02:36:50.978470] Epoch: [0] [850/1504] lr: 0.000029 closs: 0.9470 (1.1882) grad_norm: 0.7039 (1.6259) time: 4.1807 data: 0.0002 max mem: 51710 [02:37:32.551768] Epoch: [0] [860/1504] lr: 0.000029 closs: 0.9366 (1.1853) grad_norm: 0.7039 (1.6259) time: 4.1550 data: 0.0003 max mem: 51710 [02:38:17.135672] Epoch: [0] [870/1504] lr: 0.000029 closs: 0.9822 (1.1842) grad_norm: 0.7039 (1.6399) time: 4.3077 data: 0.0003 max mem: 51710 [02:39:02.873962] Epoch: [0] [880/1504] lr: 0.000029 closs: 1.1222 (1.1836) grad_norm: 0.7039 (1.6399) time: 4.5159 data: 0.0003 max mem: 51710 [02:39:48.741472] Epoch: [0] [890/1504] lr: 0.000029 closs: 1.1222 (1.1831) grad_norm: 0.7039 (1.6399) time: 4.5801 data: 0.0003 max mem: 51710 [02:40:34.759871] Epoch: [0] [900/1504] lr: 0.000029 closs: 1.1207 (1.1822) grad_norm: 0.7039 (1.6110) time: 4.5941 data: 0.0003 max mem: 51710 [02:41:21.141682] Epoch: [0] [910/1504] lr: 0.000029 closs: 1.1207 (1.1815) grad_norm: 0.7039 (1.6110) time: 4.6199 data: 0.0003 max mem: 51710 [02:42:06.900489] Epoch: [0] [920/1504] lr: 0.000029 closs: 1.1332 (1.1811) grad_norm: 0.7039 (1.6110) time: 4.6069 data: 0.0002 max mem: 51710 [02:42:53.278669] Epoch: [0] [930/1504] lr: 0.000029 closs: 1.1278 (1.1805) grad_norm: 0.6564 (1.5756) time: 4.6067 data: 0.0003 max mem: 51710 [02:43:38.979076] Epoch: [0] [940/1504] lr: 0.000029 closs: 1.1158 (1.1797) grad_norm: 0.6564 (1.5756) time: 4.6038 data: 0.0003 max mem: 51710 [02:44:24.904207] Epoch: [0] [950/1504] lr: 0.000029 closs: 1.0933 (1.1789) grad_norm: 0.6564 (1.5756) time: 4.5811 data: 0.0002 max mem: 51710 [02:45:10.710130] Epoch: [0] [960/1504] lr: 0.000029 closs: 1.1078 (1.1784) grad_norm: 0.6237 (1.5380) time: 4.5863 data: 0.0002 max mem: 51710 [02:45:56.443279] Epoch: [0] [970/1504] lr: 0.000029 closs: 1.1168 (1.1778) grad_norm: 0.6237 (1.5380) time: 4.5767 data: 0.0003 max mem: 51710 [02:46:42.368440] Epoch: [0] [980/1504] lr: 0.000029 closs: 1.1057 (1.1770) grad_norm: 0.6237 (1.5380) time: 4.5828 data: 0.0003 max mem: 51710 [02:47:28.111227] Epoch: [0] [990/1504] lr: 0.000029 closs: 1.1047 (1.1764) grad_norm: 0.6237 (1.5380) time: 4.5832 data: 0.0003 max mem: 51710 [02:48:10.106111] Epoch: [0] [1000/1504] lr: 0.000028 closs: 1.0493 (1.1737) grad_norm: 0.5832 (1.5045) time: 4.3867 data: 0.0002 max mem: 51710 [02:48:52.370192] Epoch: [0] [1010/1504] lr: 0.000028 closs: 0.9139 (1.1712) grad_norm: 0.5832 (1.5045) time: 4.2128 data: 0.0002 max mem: 51710 [02:49:34.234564] Epoch: [0] [1020/1504] lr: 0.000028 closs: 0.9296 (1.1689) grad_norm: 0.5832 (1.5045) time: 4.2063 data: 0.0002 max mem: 51710 [02:50:15.823348] Epoch: [0] [1030/1504] lr: 0.000028 closs: 0.9049 (1.1664) grad_norm: 0.5832 (1.4772) time: 4.1725 data: 0.0003 max mem: 51710 [02:50:57.316320] Epoch: [0] [1040/1504] lr: 0.000028 closs: 0.9012 (1.1640) grad_norm: 0.5832 (1.4772) time: 4.1539 data: 0.0003 max mem: 51710 [02:51:39.006193] Epoch: [0] [1050/1504] lr: 0.000028 closs: 0.8929 (1.1612) grad_norm: 0.5832 (1.4772) time: 4.1590 data: 0.0002 max mem: 51710 [02:52:20.782455] Epoch: [0] [1060/1504] lr: 0.000028 closs: 0.8751 (1.1586) grad_norm: 0.5133 (1.4465) time: 4.1731 data: 0.0002 max mem: 51710 [02:53:02.237813] Epoch: [0] [1070/1504] lr: 0.000028 closs: 0.8842 (1.1561) grad_norm: 0.5133 (1.4465) time: 4.1614 data: 0.0002 max mem: 51710 [02:53:43.670978] Epoch: [0] [1080/1504] lr: 0.000028 closs: 0.8964 (1.1536) grad_norm: 0.5133 (1.4465) time: 4.1443 data: 0.0002 max mem: 51710 [02:54:26.810451] Epoch: [0] [1090/1504] lr: 0.000028 closs: 0.9051 (1.1519) grad_norm: 0.5113 (1.4129) time: 4.2285 data: 0.0002 max mem: 51710 [02:55:12.780580] Epoch: [0] [1100/1504] lr: 0.000028 closs: 1.1025 (1.1518) grad_norm: 0.5113 (1.4129) time: 4.4553 data: 0.0003 max mem: 51710 [02:55:58.667617] Epoch: [0] [1110/1504] lr: 0.000028 closs: 1.1652 (1.1518) grad_norm: 0.5113 (1.4129) time: 4.5927 data: 0.0003 max mem: 51710 [02:56:44.552628] Epoch: [0] [1120/1504] lr: 0.000028 closs: 1.1489 (1.1517) grad_norm: 0.5113 (1.3956) time: 4.5884 data: 0.0002 max mem: 51710 [02:57:30.207689] Epoch: [0] [1130/1504] lr: 0.000028 closs: 1.1134 (1.1514) grad_norm: 0.5113 (1.3956) time: 4.5768 data: 0.0003 max mem: 51710 [02:58:16.499151] Epoch: [0] [1140/1504] lr: 0.000028 closs: 1.1545 (1.1515) grad_norm: 0.5113 (1.3956) time: 4.5972 data: 0.0003 max mem: 51710 [02:59:02.191078] Epoch: [0] [1150/1504] lr: 0.000028 closs: 1.1561 (1.1514) grad_norm: 0.5113 (1.3956) time: 4.5990 data: 0.0003 max mem: 51710 [02:59:48.097943] Epoch: [0] [1160/1504] lr: 0.000028 closs: 1.1357 (1.1511) grad_norm: 0.5133 (1.3717) time: 4.5798 data: 0.0002 max mem: 51710 [03:00:33.940094] Epoch: [0] [1170/1504] lr: 0.000028 closs: 1.1177 (1.1506) grad_norm: 0.5133 (1.3717) time: 4.5873 data: 0.0002 max mem: 51710 [03:01:19.782604] Epoch: [0] [1180/1504] lr: 0.000028 closs: 1.0880 (1.1502) grad_norm: 0.5133 (1.3717) time: 4.5841 data: 0.0003 max mem: 51710 [03:02:05.882630] Epoch: [0] [1190/1504] lr: 0.000027 closs: 1.0922 (1.1498) grad_norm: 0.5113 (1.3452) time: 4.5970 data: 0.0003 max mem: 51710 [03:02:51.823169] Epoch: [0] [1200/1504] lr: 0.000027 closs: 1.0958 (1.1494) grad_norm: 0.5113 (1.3452) time: 4.6019 data: 0.0003 max mem: 51710 [03:03:37.568225] Epoch: [0] [1210/1504] lr: 0.000027 closs: 1.1030 (1.1489) grad_norm: 0.5113 (1.3452) time: 4.5842 data: 0.0002 max mem: 51710 [03:04:23.347093] Epoch: [0] [1220/1504] lr: 0.000027 closs: 1.0936 (1.1484) grad_norm: 0.5113 (1.3223) time: 4.5761 data: 0.0002 max mem: 51710 [03:05:09.105734] Epoch: [0] [1230/1504] lr: 0.000027 closs: 1.0774 (1.1479) grad_norm: 0.5113 (1.3223) time: 4.5768 data: 0.0003 max mem: 51710 [03:05:55.157620] Epoch: [0] [1240/1504] lr: 0.000027 closs: 1.1016 (1.1477) grad_norm: 0.5113 (1.3223) time: 4.5904 data: 0.0003 max mem: 51710 [03:06:40.105678] Epoch: [0] [1250/1504] lr: 0.000027 closs: 1.1123 (1.1470) grad_norm: 0.5113 (1.3013) time: 4.5499 data: 0.0003 max mem: 51710 [03:07:21.736953] Epoch: [0] [1260/1504] lr: 0.000027 closs: 0.9396 (1.1451) grad_norm: 0.5113 (1.3013) time: 4.3289 data: 0.0003 max mem: 51710 [03:08:03.434408] Epoch: [0] [1270/1504] lr: 0.000027 closs: 0.9396 (1.1435) grad_norm: 0.5113 (1.3013) time: 4.1663 data: 0.0003 max mem: 51710 [03:08:45.766915] Epoch: [0] [1280/1504] lr: 0.000027 closs: 0.9523 (1.1419) grad_norm: 0.5359 (1.2929) time: 4.2013 data: 0.0003 max mem: 51710 [03:09:31.440710] Epoch: [0] [1290/1504] lr: 0.000027 closs: 1.0482 (1.1417) grad_norm: 0.5359 (1.2929) time: 4.4001 data: 0.0002 max mem: 51710 [03:10:17.345359] Epoch: [0] [1300/1504] lr: 0.000027 closs: 1.1189 (1.1416) grad_norm: 0.5359 (1.2929) time: 4.5787 data: 0.0002 max mem: 51710 [03:11:03.184512] Epoch: [0] [1310/1504] lr: 0.000027 closs: 1.1189 (1.1413) grad_norm: 0.5359 (1.2929) time: 4.5870 data: 0.0003 max mem: 51710 [03:11:49.267315] Epoch: [0] [1320/1504] lr: 0.000027 closs: 1.1061 (1.1411) grad_norm: 0.5359 (1.2757) time: 4.5960 data: 0.0003 max mem: 51710 [03:12:35.121575] Epoch: [0] [1330/1504] lr: 0.000027 closs: 1.1125 (1.1409) grad_norm: 0.5359 (1.2757) time: 4.5967 data: 0.0003 max mem: 51710 [03:13:20.928661] Epoch: [0] [1340/1504] lr: 0.000027 closs: 1.1089 (1.1405) grad_norm: 0.5359 (1.2757) time: 4.5829 data: 0.0003 max mem: 51710 [03:14:04.305535] Epoch: [0] [1350/1504] lr: 0.000026 closs: 1.0430 (1.1391) grad_norm: 0.5281 (1.2579) time: 4.4591 data: 0.0003 max mem: 51710 [03:14:45.779248] Epoch: [0] [1360/1504] lr: 0.000026 closs: 0.9102 (1.1374) grad_norm: 0.5281 (1.2579) time: 4.2424 data: 0.0003 max mem: 51710 [03:15:27.266473] Epoch: [0] [1370/1504] lr: 0.000026 closs: 0.8950 (1.1355) grad_norm: 0.5281 (1.2579) time: 4.1479 data: 0.0003 max mem: 51710 [03:16:10.971808] Epoch: [0] [1380/1504] lr: 0.000026 closs: 0.9011 (1.1344) grad_norm: 0.5281 (1.2371) time: 4.2595 data: 0.0003 max mem: 51710 [03:16:57.038825] Epoch: [0] [1390/1504] lr: 0.000026 closs: 1.0915 (1.1341) grad_norm: 0.5281 (1.2371) time: 4.4885 data: 0.0003 max mem: 51710 [03:17:42.852467] Epoch: [0] [1400/1504] lr: 0.000026 closs: 1.0960 (1.1338) grad_norm: 0.5281 (1.2371) time: 4.5939 data: 0.0003 max mem: 51710 [03:18:28.726486] Epoch: [0] [1410/1504] lr: 0.000026 closs: 1.0980 (1.1335) grad_norm: 0.5113 (1.2193) time: 4.5842 data: 0.0003 max mem: 51710 [03:19:14.594822] Epoch: [0] [1420/1504] lr: 0.000026 closs: 1.0980 (1.1333) grad_norm: 0.5113 (1.2193) time: 4.5870 data: 0.0002 max mem: 51710 [03:20:00.672666] Epoch: [0] [1430/1504] lr: 0.000026 closs: 1.0889 (1.1329) grad_norm: 0.5113 (1.2193) time: 4.5972 data: 0.0002 max mem: 51710 [03:20:46.446086] Epoch: [0] [1440/1504] lr: 0.000026 closs: 1.1014 (1.1328) grad_norm: 0.5015 (1.2012) time: 4.5924 data: 0.0002 max mem: 51710 [03:21:32.538074] Epoch: [0] [1450/1504] lr: 0.000026 closs: 1.1070 (1.1324) grad_norm: 0.5015 (1.2012) time: 4.5932 data: 0.0002 max mem: 51710 [03:22:18.282330] Epoch: [0] [1460/1504] lr: 0.000026 closs: 1.0970 (1.1322) grad_norm: 0.5015 (1.2012) time: 4.5917 data: 0.0002 max mem: 51710 [03:23:04.012022] Epoch: [0] [1470/1504] lr: 0.000026 closs: 1.0970 (1.1320) grad_norm: 0.5015 (1.2012) time: 4.5735 data: 0.0002 max mem: 51710 [03:23:49.951215] Epoch: [0] [1480/1504] lr: 0.000026 closs: 1.0800 (1.1317) grad_norm: 0.4995 (1.1826) time: 4.5833 data: 0.0002 max mem: 51710 [03:24:35.668475] Epoch: [0] [1490/1504] lr: 0.000026 closs: 1.0779 (1.1312) grad_norm: 0.4995 (1.1826) time: 4.5827 data: 0.0002 max mem: 51710 [03:25:21.899546] Epoch: [0] [1500/1504] lr: 0.000026 closs: 1.0741 (1.1309) grad_norm: 0.4995 (1.1826) time: 4.5973 data: 0.0003 max mem: 51710 [03:25:36.277038] Epoch: [0] Total time: 1:52:18 [03:25:36.360888] Averaged stats: lr: 0.000026 closs: 1.0759 (1.1304) grad_norm: 0.4759 (1.1655) [03:29:07.917266] optimizer saved [03:29:07.921978] other rank-common saved [03:29:07.929714] rank-specific saved [03:29:07.942103] log_dir: ./output_dir [03:29:13.729627] Epoch: [1] [0/1504] lr: 0.000025 closs: 1.0072 (1.0072) time: 5.7865 data: 1.2589 max mem: 51710 [03:29:59.478134] Epoch: [1] [10/1504] lr: 0.000025 closs: 1.0340 (1.0370) time: 4.6849 data: 0.1147 max mem: 51710 [03:30:45.570149] Epoch: [1] [20/1504] lr: 0.000025 closs: 1.0340 (1.0384) time: 4.5919 data: 0.0002 max mem: 51710 [03:31:31.198840] Epoch: [1] [30/1504] lr: 0.000025 closs: 1.0396 (1.0409) time: 4.5859 data: 0.0002 max mem: 51710 [03:32:17.224065] Epoch: [1] [40/1504] lr: 0.000025 closs: 1.0549 (1.0420) grad_norm: 0.3791 (0.3791) time: 4.5826 data: 0.0002 max mem: 51710 [03:33:03.150098] Epoch: [1] [50/1504] lr: 0.000025 closs: 1.0231 (1.0365) grad_norm: 0.3791 (0.3791) time: 4.5975 data: 0.0002 max mem: 51710 [03:33:49.261754] Epoch: [1] [60/1504] lr: 0.000025 closs: 1.0178 (1.0327) grad_norm: 0.3791 (0.3791) time: 4.6018 data: 0.0002 max mem: 51710 [03:34:32.049065] Epoch: [1] [70/1504] lr: 0.000025 closs: 0.9453 (1.0075) grad_norm: 0.3407 (0.3599) time: 4.4448 data: 0.0003 max mem: 51710 [03:35:13.979108] Epoch: [1] [80/1504] lr: 0.000025 closs: 0.7916 (0.9819) grad_norm: 0.3407 (0.3599) time: 4.2357 data: 0.0003 max mem: 51710 [03:35:55.460760] Epoch: [1] [90/1504] lr: 0.000025 closs: 0.7915 (0.9615) grad_norm: 0.3407 (0.3599) time: 4.1705 data: 0.0002 max mem: 51710 [03:36:37.246122] Epoch: [1] [100/1504] lr: 0.000025 closs: 0.7880 (0.9452) grad_norm: 0.3791 (0.4475) time: 4.1633 data: 0.0002 max mem: 51710 [03:37:18.893773] Epoch: [1] [110/1504] lr: 0.000025 closs: 0.7880 (0.9336) grad_norm: 0.3791 (0.4475) time: 4.1715 data: 0.0002 max mem: 51710 [03:38:00.339433] Epoch: [1] [120/1504] lr: 0.000025 closs: 0.8209 (0.9240) grad_norm: 0.3791 (0.4475) time: 4.1545 data: 0.0002 max mem: 51710 [03:38:41.988212] Epoch: [1] [130/1504] lr: 0.000024 closs: 0.7977 (0.9143) grad_norm: 0.3791 (0.4388) time: 4.1546 data: 0.0002 max mem: 51710 [03:39:23.859124] Epoch: [1] [140/1504] lr: 0.000024 closs: 0.7851 (0.9054) grad_norm: 0.3791 (0.4388) time: 4.1759 data: 0.0002 max mem: 51710 [03:40:05.482266] Epoch: [1] [150/1504] lr: 0.000024 closs: 0.7809 (0.8976) grad_norm: 0.3791 (0.4388) time: 4.1746 data: 0.0002 max mem: 51710 [03:40:47.730197] Epoch: [1] [160/1504] lr: 0.000024 closs: 0.7870 (0.8928) grad_norm: 0.3791 (0.4082) time: 4.1934 data: 0.0002 max mem: 51710 [03:41:33.436296] Epoch: [1] [170/1504] lr: 0.000024 closs: 0.9871 (0.9031) grad_norm: 0.3791 (0.4082) time: 4.3976 data: 0.0002 max mem: 51710 [03:42:19.337006] Epoch: [1] [180/1504] lr: 0.000024 closs: 1.0519 (0.9103) grad_norm: 0.3791 (0.4082) time: 4.5802 data: 0.0002 max mem: 51710 [03:43:05.040038] Epoch: [1] [190/1504] lr: 0.000024 closs: 1.0405 (0.9181) grad_norm: 0.3791 (0.4082) time: 4.5801 data: 0.0002 max mem: 51710 [03:43:50.818086] Epoch: [1] [200/1504] lr: 0.000024 closs: 1.0391 (0.9236) grad_norm: 0.3791 (0.4761) time: 4.5740 data: 0.0002 max mem: 51710 [03:44:37.143693] Epoch: [1] [210/1504] lr: 0.000024 closs: 1.0218 (0.9274) grad_norm: 0.3791 (0.4761) time: 4.6051 data: 0.0002 max mem: 51710 [03:45:23.004394] Epoch: [1] [220/1504] lr: 0.000024 closs: 1.0186 (0.9312) grad_norm: 0.3791 (0.4761) time: 4.6092 data: 0.0002 max mem: 51710 [03:46:08.776124] Epoch: [1] [230/1504] lr: 0.000024 closs: 1.0275 (0.9358) grad_norm: 0.4130 (0.4722) time: 4.5815 data: 0.0002 max mem: 51710 [03:46:54.468595] Epoch: [1] [240/1504] lr: 0.000024 closs: 1.0346 (0.9392) grad_norm: 0.4130 (0.4722) time: 4.5731 data: 0.0002 max mem: 51710 [03:47:40.183430] Epoch: [1] [250/1504] lr: 0.000024 closs: 1.0068 (0.9420) grad_norm: 0.4130 (0.4722) time: 4.5703 data: 0.0002 max mem: 51710 [03:48:26.297609] Epoch: [1] [260/1504] lr: 0.000023 closs: 1.0068 (0.9451) grad_norm: 0.4130 (0.4817) time: 4.5914 data: 0.0002 max mem: 51710 [03:49:12.024875] Epoch: [1] [270/1504] lr: 0.000023 closs: 1.0232 (0.9479) grad_norm: 0.4130 (0.4817) time: 4.5920 data: 0.0002 max mem: 51710 [03:49:58.268822] Epoch: [1] [280/1504] lr: 0.000023 closs: 1.0167 (0.9506) grad_norm: 0.4130 (0.4817) time: 4.5985 data: 0.0002 max mem: 51710 [03:50:43.985122] Epoch: [1] [290/1504] lr: 0.000023 closs: 1.0125 (0.9527) grad_norm: 0.4487 (0.4881) time: 4.5979 data: 0.0002 max mem: 51710 [03:51:29.632624] Epoch: [1] [300/1504] lr: 0.000023 closs: 1.0099 (0.9544) grad_norm: 0.4487 (0.4881) time: 4.5681 data: 0.0002 max mem: 51710 [03:52:15.462589] Epoch: [1] [310/1504] lr: 0.000023 closs: 1.0209 (0.9564) grad_norm: 0.4487 (0.4881) time: 4.5738 data: 0.0002 max mem: 51710 [03:53:01.162949] Epoch: [1] [320/1504] lr: 0.000023 closs: 0.9765 (0.9570) grad_norm: 0.4337 (0.4826) time: 4.5764 data: 0.0002 max mem: 51710 [03:53:42.747086] Epoch: [1] [330/1504] lr: 0.000023 closs: 0.8785 (0.9529) grad_norm: 0.4337 (0.4826) time: 4.3641 data: 0.0002 max mem: 51710 [03:54:24.285651] Epoch: [1] [340/1504] lr: 0.000023 closs: 0.8090 (0.9488) grad_norm: 0.4337 (0.4826) time: 4.1560 data: 0.0002 max mem: 51710 [03:55:06.042333] Epoch: [1] [350/1504] lr: 0.000023 closs: 0.8304 (0.9458) grad_norm: 0.4337 (0.4826) time: 4.1646 data: 0.0002 max mem: 51710 [03:55:51.341660] Epoch: [1] [360/1504] lr: 0.000022 closs: 0.8927 (0.9478) grad_norm: 0.4487 (0.5018) time: 4.3527 data: 0.0002 max mem: 51710 [03:56:36.973428] Epoch: [1] [370/1504] lr: 0.000022 closs: 1.0269 (0.9494) grad_norm: 0.4487 (0.5018) time: 4.5464 data: 0.0002 max mem: 51710 [03:57:22.758825] Epoch: [1] [380/1504] lr: 0.000022 closs: 1.0231 (0.9514) grad_norm: 0.4487 (0.5018) time: 4.5708 data: 0.0002 max mem: 51710 [03:58:08.851928] Epoch: [1] [390/1504] lr: 0.000022 closs: 1.0343 (0.9541) grad_norm: 0.4487 (0.5230) time: 4.5938 data: 0.0002 max mem: 51710 [03:58:54.901330] Epoch: [1] [400/1504] lr: 0.000022 closs: 1.0354 (0.9561) grad_norm: 0.4487 (0.5230) time: 4.6070 data: 0.0002 max mem: 51710 [03:59:40.533551] Epoch: [1] [410/1504] lr: 0.000022 closs: 1.0294 (0.9576) grad_norm: 0.4487 (0.5230) time: 4.5840 data: 0.0002 max mem: 51710 [04:00:26.255471] Epoch: [1] [420/1504] lr: 0.000022 closs: 1.0194 (0.9591) grad_norm: 0.5391 (0.5255) time: 4.5676 data: 0.0002 max mem: 51710 [04:01:12.028110] Epoch: [1] [430/1504] lr: 0.000022 closs: 1.0061 (0.9597) grad_norm: 0.5391 (0.5255) time: 4.5746 data: 0.0002 max mem: 51710 [04:01:57.808245] Epoch: [1] [440/1504] lr: 0.000022 closs: 0.9886 (0.9608) grad_norm: 0.5391 (0.5255) time: 4.5776 data: 0.0002 max mem: 51710 [04:02:43.660850] Epoch: [1] [450/1504] lr: 0.000022 closs: 1.0134 (0.9620) grad_norm: 0.4487 (0.5191) time: 4.5816 data: 0.0002 max mem: 51710 [04:03:29.609946] Epoch: [1] [460/1504] lr: 0.000022 closs: 1.0187 (0.9629) grad_norm: 0.4487 (0.5191) time: 4.5900 data: 0.0002 max mem: 51710 [04:04:15.479485] Epoch: [1] [470/1504] lr: 0.000022 closs: 1.0187 (0.9641) grad_norm: 0.4487 (0.5191) time: 4.5908 data: 0.0002 max mem: 51710 [04:05:01.210057] Epoch: [1] [480/1504] lr: 0.000021 closs: 1.0095 (0.9650) grad_norm: 0.4769 (0.5163) time: 4.5799 data: 0.0002 max mem: 51710 [04:05:47.006816] Epoch: [1] [490/1504] lr: 0.000021 closs: 1.0095 (0.9661) grad_norm: 0.4769 (0.5163) time: 4.5763 data: 0.0002 max mem: 51710 [04:06:33.005069] Epoch: [1] [500/1504] lr: 0.000021 closs: 1.0219 (0.9670) grad_norm: 0.4769 (0.5163) time: 4.5897 data: 0.0002 max mem: 51710 [04:07:18.826298] Epoch: [1] [510/1504] lr: 0.000021 closs: 1.0020 (0.9678) grad_norm: 0.4769 (0.5163) time: 4.5909 data: 0.0002 max mem: 51710 [04:08:04.590892] Epoch: [1] [520/1504] lr: 0.000021 closs: 1.0042 (0.9688) grad_norm: 0.4487 (0.5099) time: 4.5792 data: 0.0002 max mem: 51710 [04:08:50.399797] Epoch: [1] [530/1504] lr: 0.000021 closs: 1.0087 (0.9697) grad_norm: 0.4487 (0.5099) time: 4.5785 data: 0.0002 max mem: 51710 [04:09:36.425124] Epoch: [1] [540/1504] lr: 0.000021 closs: 1.0067 (0.9705) grad_norm: 0.4487 (0.5099) time: 4.5916 data: 0.0002 max mem: 51710 [04:10:22.103740] Epoch: [1] [550/1504] lr: 0.000021 closs: 1.0067 (0.9714) grad_norm: 0.4487 (0.5045) time: 4.5851 data: 0.0002 max mem: 51710 [04:11:08.089807] Epoch: [1] [560/1504] lr: 0.000021 closs: 1.0038 (0.9720) grad_norm: 0.4487 (0.5045) time: 4.5831 data: 0.0002 max mem: 51710 [04:11:53.794739] Epoch: [1] [570/1504] lr: 0.000021 closs: 0.9925 (0.9724) grad_norm: 0.4487 (0.5045) time: 4.5845 data: 0.0002 max mem: 51710 [04:12:37.466841] Epoch: [1] [580/1504] lr: 0.000021 closs: 0.9850 (0.9716) grad_norm: 0.4357 (0.4989) time: 4.4688 data: 0.0002 max mem: 51710 [04:13:19.231480] Epoch: [1] [590/1504] lr: 0.000021 closs: 0.8275 (0.9685) grad_norm: 0.4357 (0.4989) time: 4.2717 data: 0.0002 max mem: 51710 [04:14:00.650759] Epoch: [1] [600/1504] lr: 0.000021 closs: 0.7838 (0.9652) grad_norm: 0.4357 (0.4989) time: 4.1591 data: 0.0002 max mem: 51710 [04:14:43.779355] Epoch: [1] [610/1504] lr: 0.000020 closs: 0.7911 (0.9630) grad_norm: 0.4487 (0.4975) time: 4.2273 data: 0.0002 max mem: 51710 [04:15:29.632892] Epoch: [1] [620/1504] lr: 0.000020 closs: 0.9626 (0.9636) grad_norm: 0.4487 (0.4975) time: 4.4490 data: 0.0002 max mem: 51710 [04:16:15.575821] Epoch: [1] [630/1504] lr: 0.000020 closs: 1.0050 (0.9645) grad_norm: 0.4487 (0.4975) time: 4.5897 data: 0.0002 max mem: 51710 [04:17:00.903247] Epoch: [1] [640/1504] lr: 0.000020 closs: 1.0236 (0.9653) grad_norm: 0.4487 (0.5016) time: 4.5634 data: 0.0002 max mem: 51710 [04:17:42.461340] Epoch: [1] [650/1504] lr: 0.000020 closs: 0.9004 (0.9629) grad_norm: 0.4487 (0.5016) time: 4.3442 data: 0.0002 max mem: 51710 [04:18:24.149056] Epoch: [1] [660/1504] lr: 0.000020 closs: 0.7941 (0.9603) grad_norm: 0.4487 (0.5016) time: 4.1622 data: 0.0002 max mem: 51710 [04:19:05.552642] Epoch: [1] [670/1504] lr: 0.000020 closs: 0.7675 (0.9573) grad_norm: 0.4487 (0.5016) time: 4.1544 data: 0.0001 max mem: 51710 [04:19:51.259648] Epoch: [1] [680/1504] lr: 0.000020 closs: 0.8071 (0.9576) grad_norm: 0.4487 (0.4892) time: 4.3554 data: 0.0002 max mem: 51710 [04:20:37.069611] Epoch: [1] [690/1504] lr: 0.000020 closs: 1.0059 (0.9587) grad_norm: 0.4487 (0.4892) time: 4.5758 data: 0.0002 max mem: 51710 [04:21:22.878381] Epoch: [1] [700/1504] lr: 0.000020 closs: 1.0078 (0.9591) grad_norm: 0.4487 (0.4892) time: 4.5808 data: 0.0002 max mem: 51710 [04:22:05.601950] Epoch: [1] [710/1504] lr: 0.000019 closs: 0.9670 (0.9579) grad_norm: 0.4487 (0.4853) time: 4.4265 data: 0.0002 max mem: 51710 [04:22:47.138170] Epoch: [1] [720/1504] lr: 0.000019 closs: 0.7891 (0.9554) grad_norm: 0.4487 (0.4853) time: 4.2129 data: 0.0002 max mem: 51710 [04:23:28.669547] Epoch: [1] [730/1504] lr: 0.000019 closs: 0.7856 (0.9532) grad_norm: 0.4487 (0.4853) time: 4.1533 data: 0.0002 max mem: 51710 [04:24:12.271524] Epoch: [1] [740/1504] lr: 0.000019 closs: 0.7931 (0.9523) grad_norm: 0.4357 (0.4780) time: 4.2566 data: 0.0002 max mem: 51710 [04:24:58.060641] Epoch: [1] [750/1504] lr: 0.000019 closs: 0.9850 (0.9530) grad_norm: 0.4357 (0.4780) time: 4.4694 data: 0.0002 max mem: 51710 [04:25:43.976338] Epoch: [1] [760/1504] lr: 0.000019 closs: 1.0115 (0.9538) grad_norm: 0.4357 (0.4780) time: 4.5852 data: 0.0002 max mem: 51710 [04:26:29.889000] Epoch: [1] [770/1504] lr: 0.000019 closs: 0.9954 (0.9542) grad_norm: 0.4357 (0.4757) time: 4.5913 data: 0.0002 max mem: 51710 [04:27:15.664111] Epoch: [1] [780/1504] lr: 0.000019 closs: 0.9810 (0.9547) grad_norm: 0.4357 (0.4757) time: 4.5843 data: 0.0002 max mem: 51710 [04:28:01.288860] Epoch: [1] [790/1504] lr: 0.000019 closs: 0.9845 (0.9552) grad_norm: 0.4357 (0.4757) time: 4.5699 data: 0.0002 max mem: 51710 [04:28:46.548885] Epoch: [1] [800/1504] lr: 0.000018 closs: 1.0017 (0.9554) grad_norm: 0.4357 (0.4698) time: 4.5441 data: 0.0002 max mem: 51710 [04:29:28.250181] Epoch: [1] [810/1504] lr: 0.000018 closs: 0.8264 (0.9534) grad_norm: 0.4357 (0.4698) time: 4.3480 data: 0.0002 max mem: 51710 [04:30:09.773073] Epoch: [1] [820/1504] lr: 0.000018 closs: 0.7911 (0.9513) grad_norm: 0.4357 (0.4698) time: 4.1611 data: 0.0002 max mem: 51710 [04:30:51.540908] Epoch: [1] [830/1504] lr: 0.000018 closs: 0.7911 (0.9493) grad_norm: 0.4357 (0.4698) time: 4.1645 data: 0.0002 max mem: 51710 [04:31:37.232293] Epoch: [1] [840/1504] lr: 0.000018 closs: 0.8154 (0.9498) grad_norm: 0.4337 (0.4634) time: 4.3729 data: 0.0002 max mem: 51710 [04:32:23.071573] Epoch: [1] [850/1504] lr: 0.000018 closs: 1.0115 (0.9504) grad_norm: 0.4337 (0.4634) time: 4.5764 data: 0.0002 max mem: 51710 [04:33:08.795956] Epoch: [1] [860/1504] lr: 0.000018 closs: 1.0039 (0.9510) grad_norm: 0.4337 (0.4634) time: 4.5781 data: 0.0002 max mem: 51710 [04:33:51.621331] Epoch: [1] [870/1504] lr: 0.000018 closs: 0.9575 (0.9497) grad_norm: 0.4212 (0.4599) time: 4.4274 data: 0.0002 max mem: 51710 [04:34:33.021656] Epoch: [1] [880/1504] lr: 0.000018 closs: 0.7566 (0.9475) grad_norm: 0.4212 (0.4599) time: 4.2112 data: 0.0002 max mem: 51710 [04:35:14.602334] Epoch: [1] [890/1504] lr: 0.000018 closs: 0.7661 (0.9457) grad_norm: 0.4212 (0.4599) time: 4.1490 data: 0.0002 max mem: 51710 [04:35:56.454596] Epoch: [1] [900/1504] lr: 0.000018 closs: 0.7862 (0.9439) grad_norm: 0.4173 (0.4522) time: 4.1716 data: 0.0002 max mem: 51710 [04:36:38.224774] Epoch: [1] [910/1504] lr: 0.000018 closs: 0.7912 (0.9420) grad_norm: 0.4173 (0.4522) time: 4.1810 data: 0.0002 max mem: 51710 [04:37:19.822099] Epoch: [1] [920/1504] lr: 0.000018 closs: 0.7880 (0.9404) grad_norm: 0.4173 (0.4522) time: 4.1683 data: 0.0002 max mem: 51710 [04:38:02.616680] Epoch: [1] [930/1504] lr: 0.000017 closs: 0.7821 (0.9393) grad_norm: 0.4143 (0.4459) time: 4.2195 data: 0.0002 max mem: 51710 [04:38:48.474213] Epoch: [1] [940/1504] lr: 0.000017 closs: 0.9247 (0.9398) grad_norm: 0.4143 (0.4459) time: 4.4325 data: 0.0002 max mem: 51710 [04:39:34.140096] Epoch: [1] [950/1504] lr: 0.000017 closs: 1.0032 (0.9405) grad_norm: 0.4143 (0.4459) time: 4.5761 data: 0.0002 max mem: 51710 [04:40:20.062057] Epoch: [1] [960/1504] lr: 0.000017 closs: 0.9919 (0.9411) grad_norm: 0.4143 (0.4452) time: 4.5793 data: 0.0002 max mem: 51710 [04:41:05.808839] Epoch: [1] [970/1504] lr: 0.000017 closs: 0.9924 (0.9417) grad_norm: 0.4143 (0.4452) time: 4.5834 data: 0.0002 max mem: 51710 [04:41:51.842402] Epoch: [1] [980/1504] lr: 0.000017 closs: 0.9928 (0.9422) grad_norm: 0.4143 (0.4452) time: 4.5889 data: 0.0002 max mem: 51710 [04:42:38.005711] Epoch: [1] [990/1504] lr: 0.000017 closs: 1.0095 (0.9430) grad_norm: 0.4143 (0.4452) time: 4.6098 data: 0.0002 max mem: 51710 [04:43:23.704302] Epoch: [1] [1000/1504] lr: 0.000017 closs: 1.0103 (0.9434) grad_norm: 0.4037 (0.4404) time: 4.5930 data: 0.0002 max mem: 51710 [04:44:09.946022] Epoch: [1] [1010/1504] lr: 0.000017 closs: 0.9920 (0.9441) grad_norm: 0.4037 (0.4404) time: 4.5969 data: 0.0002 max mem: 51710 [04:44:55.652116] Epoch: [1] [1020/1504] lr: 0.000017 closs: 0.9880 (0.9446) grad_norm: 0.4037 (0.4404) time: 4.5973 data: 0.0002 max mem: 51710 [04:45:41.567057] Epoch: [1] [1030/1504] lr: 0.000016 closs: 0.9887 (0.9450) grad_norm: 0.4031 (0.4381) time: 4.5809 data: 0.0002 max mem: 51710 [04:46:27.236398] Epoch: [1] [1040/1504] lr: 0.000016 closs: 1.0038 (0.9457) grad_norm: 0.4031 (0.4381) time: 4.5791 data: 0.0002 max mem: 51710 [04:47:13.111650] Epoch: [1] [1050/1504] lr: 0.000016 closs: 1.0038 (0.9460) grad_norm: 0.4031 (0.4381) time: 4.5771 data: 0.0002 max mem: 51710 [04:47:59.161651] Epoch: [1] [1060/1504] lr: 0.000016 closs: 0.9789 (0.9466) grad_norm: 0.3690 (0.4353) time: 4.5962 data: 0.0002 max mem: 51710 [04:48:44.863531] Epoch: [1] [1070/1504] lr: 0.000016 closs: 0.9969 (0.9472) grad_norm: 0.3690 (0.4353) time: 4.5875 data: 0.0002 max mem: 51710 [04:49:31.051492] Epoch: [1] [1080/1504] lr: 0.000016 closs: 1.0014 (0.9477) grad_norm: 0.3690 (0.4353) time: 4.5944 data: 0.0002 max mem: 51710 [04:50:16.762838] Epoch: [1] [1090/1504] lr: 0.000016 closs: 0.9922 (0.9482) grad_norm: 0.3656 (0.4325) time: 4.5949 data: 0.0002 max mem: 51710 [04:51:02.580568] Epoch: [1] [1100/1504] lr: 0.000016 closs: 0.9994 (0.9488) grad_norm: 0.3656 (0.4325) time: 4.5763 data: 0.0002 max mem: 51710 [04:51:48.424432] Epoch: [1] [1110/1504] lr: 0.000016 closs: 1.0002 (0.9491) grad_norm: 0.3656 (0.4325) time: 4.5830 data: 0.0002 max mem: 51710 [04:52:33.728334] Epoch: [1] [1120/1504] lr: 0.000015 closs: 0.9888 (0.9494) grad_norm: 0.3656 (0.4307) time: 4.5573 data: 0.0002 max mem: 51710 [04:53:15.478799] Epoch: [1] [1130/1504] lr: 0.000015 closs: 0.8650 (0.9480) grad_norm: 0.3656 (0.4307) time: 4.3526 data: 0.0002 max mem: 51710 [04:53:57.079405] Epoch: [1] [1140/1504] lr: 0.000015 closs: 0.7868 (0.9467) grad_norm: 0.3656 (0.4307) time: 4.1675 data: 0.0002 max mem: 51710 [04:54:38.632661] Epoch: [1] [1150/1504] lr: 0.000015 closs: 0.7785 (0.9451) grad_norm: 0.3656 (0.4307) time: 4.1576 data: 0.0002 max mem: 51710 [04:55:23.924599] Epoch: [1] [1160/1504] lr: 0.000015 closs: 0.8491 (0.9453) grad_norm: 0.3478 (0.4284) time: 4.3422 data: 0.0002 max mem: 51710 [04:56:09.850361] Epoch: [1] [1170/1504] lr: 0.000015 closs: 0.9720 (0.9456) grad_norm: 0.3478 (0.4284) time: 4.5607 data: 0.0002 max mem: 51710 [04:56:55.918281] Epoch: [1] [1180/1504] lr: 0.000015 closs: 0.9755 (0.9459) grad_norm: 0.3478 (0.4284) time: 4.5996 data: 0.0002 max mem: 51710 [04:57:41.812891] Epoch: [1] [1190/1504] lr: 0.000015 closs: 0.9677 (0.9462) grad_norm: 0.3458 (0.4256) time: 4.5980 data: 0.0002 max mem: 51710 [04:58:27.437155] Epoch: [1] [1200/1504] lr: 0.000015 closs: 0.9804 (0.9466) grad_norm: 0.3458 (0.4256) time: 4.5758 data: 0.0002 max mem: 51710 [04:59:13.362485] Epoch: [1] [1210/1504] lr: 0.000015 closs: 0.9804 (0.9468) grad_norm: 0.3458 (0.4256) time: 4.5774 data: 0.0002 max mem: 51710 [04:59:59.221865] Epoch: [1] [1220/1504] lr: 0.000015 closs: 0.9729 (0.9472) grad_norm: 0.3410 (0.4216) time: 4.5891 data: 0.0003 max mem: 51710 [05:00:44.816427] Epoch: [1] [1230/1504] lr: 0.000015 closs: 0.9815 (0.9475) grad_norm: 0.3410 (0.4216) time: 4.5726 data: 0.0003 max mem: 51710 [05:01:30.785949] Epoch: [1] [1240/1504] lr: 0.000015 closs: 0.9815 (0.9480) grad_norm: 0.3410 (0.4216) time: 4.5781 data: 0.0003 max mem: 51710 [05:02:16.760814] Epoch: [1] [1250/1504] lr: 0.000014 closs: 0.9903 (0.9482) grad_norm: 0.3297 (0.4169) time: 4.5971 data: 0.0003 max mem: 51710 [05:03:02.439461] Epoch: [1] [1260/1504] lr: 0.000014 closs: 0.9749 (0.9484) grad_norm: 0.3297 (0.4169) time: 4.5826 data: 0.0003 max mem: 51710 [05:03:48.242935] Epoch: [1] [1270/1504] lr: 0.000014 closs: 0.9720 (0.9487) grad_norm: 0.3297 (0.4169) time: 4.5739 data: 0.0003 max mem: 51710 [05:04:34.156474] Epoch: [1] [1280/1504] lr: 0.000014 closs: 0.9807 (0.9489) grad_norm: 0.3249 (0.4128) time: 4.5857 data: 0.0002 max mem: 51710 [05:05:19.974515] Epoch: [1] [1290/1504] lr: 0.000014 closs: 0.9882 (0.9491) grad_norm: 0.3249 (0.4128) time: 4.5865 data: 0.0003 max mem: 51710 [05:06:05.855140] Epoch: [1] [1300/1504] lr: 0.000014 closs: 0.9882 (0.9495) grad_norm: 0.3249 (0.4128) time: 4.5848 data: 0.0002 max mem: 51710 [05:06:51.717930] Epoch: [1] [1310/1504] lr: 0.000014 closs: 0.9870 (0.9496) grad_norm: 0.3249 (0.4128) time: 4.5871 data: 0.0002 max mem: 51710 [05:07:37.662670] Epoch: [1] [1320/1504] lr: 0.000014 closs: 0.9816 (0.9499) grad_norm: 0.3249 (0.4087) time: 4.5903 data: 0.0003 max mem: 51710 [05:08:23.423518] Epoch: [1] [1330/1504] lr: 0.000014 closs: 0.9842 (0.9502) grad_norm: 0.3249 (0.4087) time: 4.5851 data: 0.0003 max mem: 51710 [05:09:09.407723] Epoch: [1] [1340/1504] lr: 0.000014 closs: 0.9874 (0.9505) grad_norm: 0.3249 (0.4087) time: 4.5871 data: 0.0002 max mem: 51710 [05:09:55.395044] Epoch: [1] [1350/1504] lr: 0.000013 closs: 0.9826 (0.9507) grad_norm: 0.3184 (0.4049) time: 4.5984 data: 0.0002 max mem: 51710 [05:10:41.348894] Epoch: [1] [1360/1504] lr: 0.000013 closs: 0.9916 (0.9511) grad_norm: 0.3184 (0.4049) time: 4.5969 data: 0.0003 max mem: 51710 [05:11:26.949029] Epoch: [1] [1370/1504] lr: 0.000013 closs: 1.0128 (0.9516) grad_norm: 0.3184 (0.4049) time: 4.5776 data: 0.0003 max mem: 51710 [05:12:12.801339] Epoch: [1] [1380/1504] lr: 0.000013 closs: 1.0022 (0.9519) grad_norm: 0.3028 (0.4008) time: 4.5725 data: 0.0003 max mem: 51710 [05:12:58.643486] Epoch: [1] [1390/1504] lr: 0.000013 closs: 0.9800 (0.9520) grad_norm: 0.3028 (0.4008) time: 4.5846 data: 0.0003 max mem: 51710 [05:13:44.525506] Epoch: [1] [1400/1504] lr: 0.000013 closs: 0.9825 (0.9524) grad_norm: 0.3028 (0.4008) time: 4.5861 data: 0.0003 max mem: 51710 [05:14:30.362250] Epoch: [1] [1410/1504] lr: 0.000013 closs: 0.9943 (0.9527) grad_norm: 0.2983 (0.3970) time: 4.5858 data: 0.0003 max mem: 51710 [05:15:16.024105] Epoch: [1] [1420/1504] lr: 0.000013 closs: 0.9888 (0.9530) grad_norm: 0.2983 (0.3970) time: 4.5748 data: 0.0003 max mem: 51710 [05:16:02.057287] Epoch: [1] [1430/1504] lr: 0.000013 closs: 0.9711 (0.9531) grad_norm: 0.2983 (0.3970) time: 4.5846 data: 0.0003 max mem: 51710 [05:16:47.823915] Epoch: [1] [1440/1504] lr: 0.000013 closs: 0.9694 (0.9533) grad_norm: 0.2735 (0.3936) time: 4.5898 data: 0.0003 max mem: 51710 [05:17:33.719233] Epoch: [1] [1450/1504] lr: 0.000013 closs: 0.9694 (0.9535) grad_norm: 0.2735 (0.3936) time: 4.5829 data: 0.0003 max mem: 51710 [05:18:19.518523] Epoch: [1] [1460/1504] lr: 0.000013 closs: 0.9909 (0.9538) grad_norm: 0.2735 (0.3936) time: 4.5846 data: 0.0002 max mem: 51710 [05:19:05.128443] Epoch: [1] [1470/1504] lr: 0.000013 closs: 0.9846 (0.9539) grad_norm: 0.2735 (0.3936) time: 4.5704 data: 0.0002 max mem: 51710 [05:19:47.352788] Epoch: [1] [1480/1504] lr: 0.000012 closs: 0.9233 (0.9531) grad_norm: 0.2695 (0.3902) time: 4.3916 data: 0.0002 max mem: 51710 [05:20:29.012456] Epoch: [1] [1490/1504] lr: 0.000012 closs: 0.8147 (0.9521) grad_norm: 0.2695 (0.3902) time: 4.1941 data: 0.0002 max mem: 51710 [05:21:10.402311] Epoch: [1] [1500/1504] lr: 0.000012 closs: 0.7704 (0.9509) grad_norm: 0.2695 (0.3902) time: 4.1524 data: 0.0003 max mem: 51710 [05:21:23.350951] Epoch: [1] Total time: 1:52:15 [05:21:23.351943] Averaged stats: lr: 0.000012 closs: 0.7626 (0.9498) grad_norm: 0.2695 (0.3900) [05:22:31.064605] model saved [05:25:00.133942] optimizer saved [05:25:00.138234] other rank-common saved [05:25:00.148210] rank-specific saved [05:25:00.165526] log_dir: ./output_dir [05:25:06.033566] Epoch: [2] [0/1504] lr: 0.000012 closs: 0.9249 (0.9249) time: 5.8674 data: 1.3491 max mem: 51710 [05:25:52.210291] Epoch: [2] [10/1504] lr: 0.000012 closs: 0.9249 (0.9352) time: 4.7312 data: 0.1228 max mem: 51710 [05:26:38.097359] Epoch: [2] [20/1504] lr: 0.000012 closs: 0.9240 (0.9312) time: 4.6031 data: 0.0002 max mem: 51710 [05:27:23.812437] Epoch: [2] [30/1504] lr: 0.000012 closs: 0.9175 (0.9249) time: 4.5800 data: 0.0002 max mem: 51710 [05:28:05.711620] Epoch: [2] [40/1504] lr: 0.000012 closs: 0.8764 (0.8756) grad_norm: 0.3828 (0.3828) time: 4.3806 data: 0.0002 max mem: 51710 [05:28:47.298010] Epoch: [2] [50/1504] lr: 0.000012 closs: 0.7257 (0.8453) grad_norm: 0.3828 (0.3828) time: 4.1742 data: 0.0002 max mem: 51710 [05:29:29.085215] Epoch: [2] [60/1504] lr: 0.000012 closs: 0.7186 (0.8192) grad_norm: 0.3828 (0.3828) time: 4.1686 data: 0.0002 max mem: 51710 [05:30:13.713792] Epoch: [2] [70/1504] lr: 0.000012 closs: 0.7056 (0.8220) grad_norm: 0.3828 (0.4040) time: 4.3207 data: 0.0002 max mem: 51710 [05:30:59.945537] Epoch: [2] [80/1504] lr: 0.000012 closs: 0.9058 (0.8334) grad_norm: 0.3828 (0.4040) time: 4.5429 data: 0.0002 max mem: 51710 [05:31:46.114146] Epoch: [2] [90/1504] lr: 0.000012 closs: 0.9193 (0.8419) grad_norm: 0.3828 (0.4040) time: 4.6199 data: 0.0002 max mem: 51710 [05:32:32.039507] Epoch: [2] [100/1504] lr: 0.000011 closs: 0.9180 (0.8482) grad_norm: 0.3828 (0.3885) time: 4.6046 data: 0.0002 max mem: 51710 [05:33:17.655921] Epoch: [2] [110/1504] lr: 0.000011 closs: 0.9177 (0.8540) grad_norm: 0.3828 (0.3885) time: 4.5770 data: 0.0002 max mem: 51710 [05:34:03.289153] Epoch: [2] [120/1504] lr: 0.000011 closs: 0.9190 (0.8593) grad_norm: 0.3828 (0.3885) time: 4.5624 data: 0.0002 max mem: 51710 [05:34:49.172170] Epoch: [2] [130/1504] lr: 0.000011 closs: 0.9049 (0.8618) grad_norm: 0.3692 (0.3837) time: 4.5757 data: 0.0002 max mem: 51710 [05:35:34.809288] Epoch: [2] [140/1504] lr: 0.000011 closs: 0.9002 (0.8652) grad_norm: 0.3692 (0.3837) time: 4.5759 data: 0.0002 max mem: 51710 [05:36:20.618478] Epoch: [2] [150/1504] lr: 0.000011 closs: 0.9152 (0.8684) grad_norm: 0.3692 (0.3837) time: 4.5722 data: 0.0002 max mem: 51710 [05:37:07.353946] Epoch: [2] [160/1504] lr: 0.000011 closs: 0.9152 (0.8717) grad_norm: 0.3692 (0.3685) time: 4.6272 data: 0.0001 max mem: 51710 [05:37:53.135519] Epoch: [2] [170/1504] lr: 0.000011 closs: 0.8994 (0.8726) grad_norm: 0.3692 (0.3685) time: 4.6258 data: 0.0001 max mem: 51710 [05:38:38.765868] Epoch: [2] [180/1504] lr: 0.000011 closs: 0.8976 (0.8745) grad_norm: 0.3692 (0.3685) time: 4.5705 data: 0.0001 max mem: 51710 [05:39:24.601266] Epoch: [2] [190/1504] lr: 0.000011 closs: 0.9066 (0.8765) grad_norm: 0.3692 (0.3685) time: 4.5732 data: 0.0001 max mem: 51710 [05:40:10.565141] Epoch: [2] [200/1504] lr: 0.000011 closs: 0.9142 (0.8788) grad_norm: 0.3577 (0.3561) time: 4.5899 data: 0.0002 max mem: 51710 [05:40:56.212651] Epoch: [2] [210/1504] lr: 0.000011 closs: 0.9149 (0.8806) grad_norm: 0.3577 (0.3561) time: 4.5805 data: 0.0002 max mem: 51710 [05:41:42.031997] Epoch: [2] [220/1504] lr: 0.000011 closs: 0.9230 (0.8830) grad_norm: 0.3577 (0.3561) time: 4.5733 data: 0.0002 max mem: 51710 [05:42:24.945543] Epoch: [2] [230/1504] lr: 0.000010 closs: 0.8895 (0.8777) grad_norm: 0.3577 (0.3461) time: 4.4366 data: 0.0002 max mem: 51710 [05:43:06.701094] Epoch: [2] [240/1504] lr: 0.000010 closs: 0.7001 (0.8706) grad_norm: 0.3577 (0.3461) time: 4.2334 data: 0.0002 max mem: 51710 [05:43:48.457529] Epoch: [2] [250/1504] lr: 0.000010 closs: 0.6857 (0.8629) grad_norm: 0.3577 (0.3461) time: 4.1755 data: 0.0001 max mem: 51710 [05:44:30.155206] Epoch: [2] [260/1504] lr: 0.000010 closs: 0.6804 (0.8564) grad_norm: 0.3577 (0.3714) time: 4.1726 data: 0.0001 max mem: 51710 [05:45:11.530649] Epoch: [2] [270/1504] lr: 0.000010 closs: 0.6921 (0.8505) grad_norm: 0.3577 (0.3714) time: 4.1536 data: 0.0001 max mem: 51710 [05:45:53.271429] Epoch: [2] [280/1504] lr: 0.000010 closs: 0.6659 (0.8436) grad_norm: 0.3577 (0.3714) time: 4.1557 data: 0.0001 max mem: 51710 [05:46:34.755582] Epoch: [2] [290/1504] lr: 0.000010 closs: 0.6621 (0.8375) grad_norm: 0.3577 (0.3602) time: 4.1612 data: 0.0001 max mem: 51710 [05:47:16.131149] Epoch: [2] [300/1504] lr: 0.000010 closs: 0.6641 (0.8327) grad_norm: 0.3577 (0.3602) time: 4.1429 data: 0.0001 max mem: 51710 [05:47:57.786406] Epoch: [2] [310/1504] lr: 0.000010 closs: 0.6770 (0.8273) grad_norm: 0.3577 (0.3602) time: 4.1514 data: 0.0001 max mem: 51710 [05:48:39.972787] Epoch: [2] [320/1504] lr: 0.000010 closs: 0.6949 (0.8248) grad_norm: 0.3259 (0.3568) time: 4.1920 data: 0.0002 max mem: 51710 [05:49:25.798635] Epoch: [2] [330/1504] lr: 0.000010 closs: 0.8502 (0.8274) grad_norm: 0.3259 (0.3568) time: 4.4005 data: 0.0002 max mem: 51710 [05:50:11.624892] Epoch: [2] [340/1504] lr: 0.000010 closs: 0.9181 (0.8303) grad_norm: 0.3259 (0.3568) time: 4.5825 data: 0.0002 max mem: 51710 [05:50:57.234059] Epoch: [2] [350/1504] lr: 0.000010 closs: 0.9181 (0.8329) grad_norm: 0.3259 (0.3568) time: 4.5717 data: 0.0002 max mem: 51710 [05:51:43.350910] Epoch: [2] [360/1504] lr: 0.000009 closs: 0.9179 (0.8354) grad_norm: 0.3577 (0.3877) time: 4.5862 data: 0.0002 max mem: 51710 [05:52:29.020706] Epoch: [2] [370/1504] lr: 0.000009 closs: 0.9326 (0.8377) grad_norm: 0.3577 (0.3877) time: 4.5892 data: 0.0002 max mem: 51710 [05:53:14.826590] Epoch: [2] [380/1504] lr: 0.000009 closs: 0.9297 (0.8398) grad_norm: 0.3577 (0.3877) time: 4.5737 data: 0.0002 max mem: 51710 [05:54:00.592920] Epoch: [2] [390/1504] lr: 0.000009 closs: 0.9293 (0.8419) grad_norm: 0.3577 (0.3981) time: 4.5785 data: 0.0002 max mem: 51710 [05:54:46.440978] Epoch: [2] [400/1504] lr: 0.000009 closs: 0.9244 (0.8436) grad_norm: 0.3577 (0.3981) time: 4.5806 data: 0.0002 max mem: 51710 [05:55:32.264994] Epoch: [2] [410/1504] lr: 0.000009 closs: 0.9118 (0.8454) grad_norm: 0.3577 (0.3981) time: 4.5835 data: 0.0002 max mem: 51710 [05:56:18.451289] Epoch: [2] [420/1504] lr: 0.000009 closs: 0.9079 (0.8467) grad_norm: 0.3692 (0.4000) time: 4.6004 data: 0.0002 max mem: 51710 [05:57:04.316582] Epoch: [2] [430/1504] lr: 0.000009 closs: 0.8959 (0.8483) grad_norm: 0.3692 (0.4000) time: 4.6025 data: 0.0002 max mem: 51710 [05:57:50.008176] Epoch: [2] [440/1504] lr: 0.000009 closs: 0.9195 (0.8500) grad_norm: 0.3692 (0.4000) time: 4.5777 data: 0.0002 max mem: 51710 [05:58:35.782767] Epoch: [2] [450/1504] lr: 0.000009 closs: 0.9187 (0.8514) grad_norm: 0.3692 (0.3987) time: 4.5732 data: 0.0002 max mem: 51710 [05:59:21.655290] Epoch: [2] [460/1504] lr: 0.000009 closs: 0.9056 (0.8530) grad_norm: 0.3692 (0.3987) time: 4.5823 data: 0.0002 max mem: 51710 [06:00:07.326533] Epoch: [2] [470/1504] lr: 0.000009 closs: 0.9179 (0.8546) grad_norm: 0.3692 (0.3987) time: 4.5771 data: 0.0002 max mem: 51710 [06:00:53.419812] Epoch: [2] [480/1504] lr: 0.000008 closs: 0.9184 (0.8558) grad_norm: 0.3826 (0.3992) time: 4.5881 data: 0.0002 max mem: 51710 [06:01:39.459063] Epoch: [2] [490/1504] lr: 0.000008 closs: 0.9097 (0.8570) grad_norm: 0.3826 (0.3992) time: 4.6065 data: 0.0002 max mem: 51710 [06:02:25.396556] Epoch: [2] [500/1504] lr: 0.000008 closs: 0.8990 (0.8580) grad_norm: 0.3826 (0.3992) time: 4.5987 data: 0.0002 max mem: 51710 [06:03:11.248982] Epoch: [2] [510/1504] lr: 0.000008 closs: 0.8990 (0.8591) grad_norm: 0.3826 (0.3992) time: 4.5894 data: 0.0001 max mem: 51710 [06:03:57.181231] Epoch: [2] [520/1504] lr: 0.000008 closs: 0.8890 (0.8599) grad_norm: 0.3826 (0.4037) time: 4.5891 data: 0.0001 max mem: 51710 [06:04:42.872151] Epoch: [2] [530/1504] lr: 0.000008 closs: 0.8890 (0.8605) grad_norm: 0.3826 (0.4037) time: 4.5811 data: 0.0002 max mem: 51710 [06:05:28.798368] Epoch: [2] [540/1504] lr: 0.000008 closs: 0.8999 (0.8616) grad_norm: 0.3826 (0.4037) time: 4.5808 data: 0.0002 max mem: 51710 [06:06:14.910040] Epoch: [2] [550/1504] lr: 0.000008 closs: 0.9169 (0.8629) grad_norm: 0.3828 (0.4046) time: 4.6018 data: 0.0002 max mem: 51710 [06:07:00.616616] Epoch: [2] [560/1504] lr: 0.000008 closs: 0.9169 (0.8637) grad_norm: 0.3828 (0.4046) time: 4.5908 data: 0.0002 max mem: 51710 [06:07:46.295037] Epoch: [2] [570/1504] lr: 0.000008 closs: 0.9152 (0.8646) grad_norm: 0.3828 (0.4046) time: 4.5692 data: 0.0002 max mem: 51710 [06:08:32.636553] Epoch: [2] [580/1504] lr: 0.000008 closs: 0.9155 (0.8653) grad_norm: 0.3826 (0.4024) time: 4.6009 data: 0.0001 max mem: 51710 [06:09:18.613557] Epoch: [2] [590/1504] lr: 0.000008 closs: 0.8844 (0.8657) grad_norm: 0.3826 (0.4024) time: 4.6158 data: 0.0001 max mem: 51710 [06:10:04.204644] Epoch: [2] [600/1504] lr: 0.000008 closs: 0.8851 (0.8663) grad_norm: 0.3826 (0.4024) time: 4.5783 data: 0.0002 max mem: 51710 [06:10:50.140576] Epoch: [2] [610/1504] lr: 0.000008 closs: 0.9041 (0.8669) grad_norm: 0.3826 (0.3979) time: 4.5763 data: 0.0002 max mem: 51710 [06:11:35.836722] Epoch: [2] [620/1504] lr: 0.000008 closs: 0.9052 (0.8675) grad_norm: 0.3826 (0.3979) time: 4.5815 data: 0.0002 max mem: 51710 [06:12:21.688039] Epoch: [2] [630/1504] lr: 0.000008 closs: 0.9128 (0.8680) grad_norm: 0.3826 (0.3979) time: 4.5773 data: 0.0002 max mem: 51710 [06:13:07.055878] Epoch: [2] [640/1504] lr: 0.000008 closs: 0.9050 (0.8684) grad_norm: 0.3692 (0.3923) time: 4.5609 data: 0.0002 max mem: 51710 [06:13:48.994094] Epoch: [2] [650/1504] lr: 0.000008 closs: 0.7517 (0.8655) grad_norm: 0.3692 (0.3923) time: 4.3652 data: 0.0001 max mem: 51710 [06:14:30.553163] Epoch: [2] [660/1504] lr: 0.000008 closs: 0.6655 (0.8627) grad_norm: 0.3692 (0.3923) time: 4.1748 data: 0.0001 max mem: 51710 [06:15:12.177839] Epoch: [2] [670/1504] lr: 0.000008 closs: 0.7011 (0.8602) grad_norm: 0.3692 (0.3923) time: 4.1591 data: 0.0001 max mem: 51710 [06:15:57.679154] Epoch: [2] [680/1504] lr: 0.000007 closs: 0.7504 (0.8606) grad_norm: 0.3692 (0.3938) time: 4.3562 data: 0.0001 max mem: 51710 [06:16:43.300569] Epoch: [2] [690/1504] lr: 0.000007 closs: 0.9047 (0.8612) grad_norm: 0.3692 (0.3938) time: 4.5560 data: 0.0001 max mem: 51710 [06:17:29.100806] Epoch: [2] [700/1504] lr: 0.000007 closs: 0.9051 (0.8619) grad_norm: 0.3692 (0.3938) time: 4.5710 data: 0.0002 max mem: 51710 [06:18:14.981297] Epoch: [2] [710/1504] lr: 0.000007 closs: 0.9077 (0.8626) grad_norm: 0.3692 (0.3942) time: 4.5839 data: 0.0002 max mem: 51710 [06:19:00.595034] Epoch: [2] [720/1504] lr: 0.000007 closs: 0.8909 (0.8630) grad_norm: 0.3692 (0.3942) time: 4.5746 data: 0.0002 max mem: 51710 [06:19:46.342372] Epoch: [2] [730/1504] lr: 0.000007 closs: 0.8936 (0.8635) grad_norm: 0.3692 (0.3942) time: 4.5679 data: 0.0002 max mem: 51710 [06:20:32.564629] Epoch: [2] [740/1504] lr: 0.000007 closs: 0.8885 (0.8639) grad_norm: 0.3795 (0.3935) time: 4.5984 data: 0.0002 max mem: 51710 [06:21:18.374706] Epoch: [2] [750/1504] lr: 0.000007 closs: 0.8785 (0.8642) grad_norm: 0.3795 (0.3935) time: 4.6015 data: 0.0002 max mem: 51710 [06:22:04.214782] Epoch: [2] [760/1504] lr: 0.000007 closs: 0.9097 (0.8650) grad_norm: 0.3795 (0.3935) time: 4.5824 data: 0.0001 max mem: 51710 [06:22:50.011569] Epoch: [2] [770/1504] lr: 0.000007 closs: 0.9019 (0.8652) grad_norm: 0.3795 (0.3906) time: 4.5818 data: 0.0001 max mem: 51710 [06:23:36.048087] Epoch: [2] [780/1504] lr: 0.000007 closs: 0.8827 (0.8656) grad_norm: 0.3795 (0.3906) time: 4.5916 data: 0.0002 max mem: 51710 [06:24:21.913752] Epoch: [2] [790/1504] lr: 0.000007 closs: 0.9135 (0.8663) grad_norm: 0.3795 (0.3906) time: 4.5950 data: 0.0002 max mem: 51710 [06:25:07.565895] Epoch: [2] [800/1504] lr: 0.000007 closs: 0.9105 (0.8662) grad_norm: 0.3795 (0.3865) time: 4.5758 data: 0.0002 max mem: 51710 [06:25:49.122179] Epoch: [2] [810/1504] lr: 0.000007 closs: 0.7177 (0.8640) grad_norm: 0.3795 (0.3865) time: 4.3603 data: 0.0002 max mem: 51710 [06:26:30.537123] Epoch: [2] [820/1504] lr: 0.000007 closs: 0.6724 (0.8615) grad_norm: 0.3795 (0.3865) time: 4.1485 data: 0.0001 max mem: 51710 [06:27:11.952825] Epoch: [2] [830/1504] lr: 0.000007 closs: 0.6724 (0.8596) grad_norm: 0.3795 (0.3865) time: 4.1414 data: 0.0001 max mem: 51710 [06:27:57.523698] Epoch: [2] [840/1504] lr: 0.000007 closs: 0.8017 (0.8601) grad_norm: 0.3795 (0.3818) time: 4.3492 data: 0.0002 max mem: 51710 [06:28:43.387460] Epoch: [2] [850/1504] lr: 0.000007 closs: 0.9038 (0.8608) grad_norm: 0.3795 (0.3818) time: 4.5716 data: 0.0002 max mem: 51710 [06:29:29.208961] Epoch: [2] [860/1504] lr: 0.000007 closs: 0.9012 (0.8612) grad_norm: 0.3795 (0.3818) time: 4.5842 data: 0.0002 max mem: 51710 [06:30:14.891827] Epoch: [2] [870/1504] lr: 0.000006 closs: 0.9012 (0.8619) grad_norm: 0.3795 (0.3794) time: 4.5751 data: 0.0002 max mem: 51710 [06:31:00.861353] Epoch: [2] [880/1504] lr: 0.000006 closs: 0.9106 (0.8622) grad_norm: 0.3795 (0.3794) time: 4.5825 data: 0.0002 max mem: 51710 [06:31:46.679214] Epoch: [2] [890/1504] lr: 0.000006 closs: 0.9051 (0.8626) grad_norm: 0.3795 (0.3794) time: 4.5893 data: 0.0002 max mem: 51710 [06:32:30.614001] Epoch: [2] [900/1504] lr: 0.000006 closs: 0.8931 (0.8619) grad_norm: 0.3661 (0.3756) time: 4.4875 data: 0.0002 max mem: 51710 [06:33:12.143381] Epoch: [2] [910/1504] lr: 0.000006 closs: 0.6940 (0.8599) grad_norm: 0.3661 (0.3756) time: 4.2731 data: 0.0002 max mem: 51710 [06:33:53.492707] Epoch: [2] [920/1504] lr: 0.000006 closs: 0.6870 (0.8582) grad_norm: 0.3661 (0.3756) time: 4.1438 data: 0.0002 max mem: 51710 [06:34:36.556522] Epoch: [2] [930/1504] lr: 0.000006 closs: 0.7075 (0.8572) grad_norm: 0.3661 (0.3714) time: 4.2206 data: 0.0002 max mem: 51710 [06:35:22.393208] Epoch: [2] [940/1504] lr: 0.000006 closs: 0.8815 (0.8577) grad_norm: 0.3661 (0.3714) time: 4.4449 data: 0.0002 max mem: 51710 [06:36:08.011978] Epoch: [2] [950/1504] lr: 0.000006 closs: 0.8997 (0.8581) grad_norm: 0.3661 (0.3714) time: 4.5727 data: 0.0002 max mem: 51710 [06:36:54.100215] Epoch: [2] [960/1504] lr: 0.000006 closs: 0.9030 (0.8586) grad_norm: 0.3661 (0.3699) time: 4.5853 data: 0.0002 max mem: 51710 [06:37:39.692404] Epoch: [2] [970/1504] lr: 0.000006 closs: 0.9100 (0.8591) grad_norm: 0.3661 (0.3699) time: 4.5839 data: 0.0002 max mem: 51710 [06:38:25.536929] Epoch: [2] [980/1504] lr: 0.000006 closs: 0.9295 (0.8597) grad_norm: 0.3661 (0.3699) time: 4.5717 data: 0.0002 max mem: 51710 [06:39:11.388741] Epoch: [2] [990/1504] lr: 0.000006 closs: 0.9278 (0.8604) grad_norm: 0.3661 (0.3699) time: 4.5847 data: 0.0002 max mem: 51710 [06:39:57.134285] Epoch: [2] [1000/1504] lr: 0.000006 closs: 0.9118 (0.8608) grad_norm: 0.3268 (0.3679) time: 4.5798 data: 0.0002 max mem: 51710 [06:40:42.803200] Epoch: [2] [1010/1504] lr: 0.000006 closs: 0.8957 (0.8612) grad_norm: 0.3268 (0.3679) time: 4.5706 data: 0.0002 max mem: 51710 [06:41:28.664597] Epoch: [2] [1020/1504] lr: 0.000006 closs: 0.8947 (0.8616) grad_norm: 0.3268 (0.3679) time: 4.5764 data: 0.0002 max mem: 51710 [06:42:15.117044] Epoch: [2] [1030/1504] lr: 0.000006 closs: 0.9004 (0.8620) grad_norm: 0.3240 (0.3641) time: 4.6156 data: 0.0002 max mem: 51710 [06:43:00.928553] Epoch: [2] [1040/1504] lr: 0.000006 closs: 0.9004 (0.8624) grad_norm: 0.3240 (0.3641) time: 4.6131 data: 0.0002 max mem: 51710 [06:43:46.942742] Epoch: [2] [1050/1504] lr: 0.000006 closs: 0.9114 (0.8630) grad_norm: 0.3240 (0.3641) time: 4.5912 data: 0.0002 max mem: 51710 [06:44:32.652469] Epoch: [2] [1060/1504] lr: 0.000006 closs: 0.9122 (0.8635) grad_norm: 0.3166 (0.3602) time: 4.5861 data: 0.0002 max mem: 51710 [06:45:18.358274] Epoch: [2] [1070/1504] lr: 0.000006 closs: 0.9079 (0.8638) grad_norm: 0.3166 (0.3602) time: 4.5707 data: 0.0002 max mem: 51710 [06:46:03.963217] Epoch: [2] [1080/1504] lr: 0.000006 closs: 0.9079 (0.8643) grad_norm: 0.3166 (0.3602) time: 4.5655 data: 0.0002 max mem: 51710 [06:46:48.715529] Epoch: [2] [1090/1504] lr: 0.000006 closs: 0.8991 (0.8640) grad_norm: 0.3158 (0.3571) time: 4.5178 data: 0.0002 max mem: 51710 [06:47:30.302242] Epoch: [2] [1100/1504] lr: 0.000006 closs: 0.7102 (0.8624) grad_norm: 0.3158 (0.3571) time: 4.3168 data: 0.0002 max mem: 51710 [06:48:12.193895] Epoch: [2] [1110/1504] lr: 0.000006 closs: 0.6913 (0.8609) grad_norm: 0.3158 (0.3571) time: 4.1738 data: 0.0001 max mem: 51710 [06:48:54.047929] Epoch: [2] [1120/1504] lr: 0.000006 closs: 0.6842 (0.8593) grad_norm: 0.3158 (0.3560) time: 4.1872 data: 0.0002 max mem: 51710 [06:49:40.022238] Epoch: [2] [1130/1504] lr: 0.000006 closs: 0.8650 (0.8599) grad_norm: 0.3158 (0.3560) time: 4.3913 data: 0.0002 max mem: 51710 [06:50:25.718754] Epoch: [2] [1140/1504] lr: 0.000006 closs: 0.9237 (0.8603) grad_norm: 0.3158 (0.3560) time: 4.5835 data: 0.0002 max mem: 51710 [06:51:11.410615] Epoch: [2] [1150/1504] lr: 0.000006 closs: 0.9123 (0.8608) grad_norm: 0.3158 (0.3560) time: 4.5693 data: 0.0002 max mem: 51710 [06:51:53.294075] Epoch: [2] [1160/1504] lr: 0.000005 closs: 0.8985 (0.8598) grad_norm: 0.3088 (0.3539) time: 4.3787 data: 0.0002 max mem: 51710 [06:52:34.720208] Epoch: [2] [1170/1504] lr: 0.000005 closs: 0.7217 (0.8585) grad_norm: 0.3088 (0.3539) time: 4.1654 data: 0.0002 max mem: 51710 [06:53:16.642525] Epoch: [2] [1180/1504] lr: 0.000005 closs: 0.6747 (0.8569) grad_norm: 0.3088 (0.3539) time: 4.1673 data: 0.0001 max mem: 51710 [06:53:58.569882] Epoch: [2] [1190/1504] lr: 0.000005 closs: 0.6744 (0.8555) grad_norm: 0.2875 (0.3504) time: 4.1924 data: 0.0002 max mem: 51710 [06:54:40.306054] Epoch: [2] [1200/1504] lr: 0.000005 closs: 0.6892 (0.8540) grad_norm: 0.2875 (0.3504) time: 4.1831 data: 0.0002 max mem: 51710 [06:55:21.875547] Epoch: [2] [1210/1504] lr: 0.000005 closs: 0.6778 (0.8525) grad_norm: 0.2875 (0.3504) time: 4.1652 data: 0.0002 max mem: 51710 [06:56:05.380644] Epoch: [2] [1220/1504] lr: 0.000005 closs: 0.7021 (0.8520) grad_norm: 0.2857 (0.3462) time: 4.2536 data: 0.0002 max mem: 51710 [06:56:50.959357] Epoch: [2] [1230/1504] lr: 0.000005 closs: 0.8716 (0.8524) grad_norm: 0.2857 (0.3462) time: 4.4541 data: 0.0002 max mem: 51710 [06:57:36.560891] Epoch: [2] [1240/1504] lr: 0.000005 closs: 0.9205 (0.8529) grad_norm: 0.2857 (0.3462) time: 4.5589 data: 0.0002 max mem: 51710 [06:58:22.582086] Epoch: [2] [1250/1504] lr: 0.000005 closs: 0.9072 (0.8533) grad_norm: 0.2857 (0.3450) time: 4.5810 data: 0.0002 max mem: 51710 [06:59:08.345922] Epoch: [2] [1260/1504] lr: 0.000005 closs: 0.9062 (0.8538) grad_norm: 0.2857 (0.3450) time: 4.5892 data: 0.0002 max mem: 51710 [06:59:54.244709] Epoch: [2] [1270/1504] lr: 0.000005 closs: 0.9130 (0.8542) grad_norm: 0.2857 (0.3450) time: 4.5830 data: 0.0002 max mem: 51710 [07:00:40.591132] Epoch: [2] [1280/1504] lr: 0.000005 closs: 0.8806 (0.8546) grad_norm: 0.2799 (0.3429) time: 4.6122 data: 0.0002 max mem: 51710 [07:01:26.274730] Epoch: [2] [1290/1504] lr: 0.000005 closs: 0.8761 (0.8549) grad_norm: 0.2799 (0.3429) time: 4.6014 data: 0.0002 max mem: 51710 [07:02:11.989442] Epoch: [2] [1300/1504] lr: 0.000005 closs: 0.9099 (0.8554) grad_norm: 0.2799 (0.3429) time: 4.5698 data: 0.0002 max mem: 51710 [07:02:57.644517] Epoch: [2] [1310/1504] lr: 0.000005 closs: 0.9099 (0.8558) grad_norm: 0.2799 (0.3429) time: 4.5684 data: 0.0002 max mem: 51710 [07:03:43.427708] Epoch: [2] [1320/1504] lr: 0.000005 closs: 0.8722 (0.8560) grad_norm: 0.2733 (0.3405) time: 4.5718 data: 0.0002 max mem: 51710 [07:04:29.253501] Epoch: [2] [1330/1504] lr: 0.000005 closs: 0.8803 (0.8564) grad_norm: 0.2733 (0.3405) time: 4.5804 data: 0.0002 max mem: 51710 [07:05:15.251180] Epoch: [2] [1340/1504] lr: 0.000005 closs: 0.9316 (0.8570) grad_norm: 0.2733 (0.3405) time: 4.5911 data: 0.0001 max mem: 51710 [07:06:01.535692] Epoch: [2] [1350/1504] lr: 0.000005 closs: 0.9115 (0.8574) grad_norm: 0.2637 (0.3385) time: 4.6140 data: 0.0002 max mem: 51710 [07:06:47.386768] Epoch: [2] [1360/1504] lr: 0.000005 closs: 0.9110 (0.8578) grad_norm: 0.2637 (0.3385) time: 4.6067 data: 0.0002 max mem: 51710 [07:07:33.033070] Epoch: [2] [1370/1504] lr: 0.000005 closs: 0.9067 (0.8582) grad_norm: 0.2637 (0.3385) time: 4.5748 data: 0.0001 max mem: 51710 [07:08:18.955419] Epoch: [2] [1380/1504] lr: 0.000005 closs: 0.8988 (0.8585) grad_norm: 0.2614 (0.3361) time: 4.5784 data: 0.0002 max mem: 51710 [07:09:04.667593] Epoch: [2] [1390/1504] lr: 0.000005 closs: 0.8906 (0.8589) grad_norm: 0.2614 (0.3361) time: 4.5816 data: 0.0002 max mem: 51710 [07:09:50.325903] Epoch: [2] [1400/1504] lr: 0.000005 closs: 0.8887 (0.8591) grad_norm: 0.2614 (0.3361) time: 4.5684 data: 0.0002 max mem: 51710 [07:10:34.793249] Epoch: [2] [1410/1504] lr: 0.000005 closs: 0.8887 (0.8590) grad_norm: 0.2537 (0.3336) time: 4.5062 data: 0.0002 max mem: 51710 [07:11:16.568745] Epoch: [2] [1420/1504] lr: 0.000005 closs: 0.7251 (0.8578) grad_norm: 0.2537 (0.3336) time: 4.3120 data: 0.0002 max mem: 51710 [07:11:58.450335] Epoch: [2] [1430/1504] lr: 0.000005 closs: 0.6839 (0.8565) grad_norm: 0.2537 (0.3336) time: 4.1828 data: 0.0001 max mem: 51710 [07:12:40.570389] Epoch: [2] [1440/1504] lr: 0.000005 closs: 0.6994 (0.8556) grad_norm: 0.2537 (0.3342) time: 4.2000 data: 0.0002 max mem: 51710 [07:13:26.227325] Epoch: [2] [1450/1504] lr: 0.000005 closs: 0.8798 (0.8560) grad_norm: 0.2537 (0.3342) time: 4.3888 data: 0.0002 max mem: 51710 [07:14:11.971575] Epoch: [2] [1460/1504] lr: 0.000005 closs: 0.8964 (0.8564) grad_norm: 0.2537 (0.3342) time: 4.5700 data: 0.0001 max mem: 51710 [07:14:57.550907] Epoch: [2] [1470/1504] lr: 0.000005 closs: 0.8964 (0.8568) grad_norm: 0.2537 (0.3342) time: 4.5661 data: 0.0001 max mem: 51710 [07:15:39.400163] Epoch: [2] [1480/1504] lr: 0.000005 closs: 0.8662 (0.8558) grad_norm: 0.2537 (0.3329) time: 4.3713 data: 0.0001 max mem: 51710 [07:16:21.232598] Epoch: [2] [1490/1504] lr: 0.000005 closs: 0.6968 (0.8547) grad_norm: 0.2537 (0.3329) time: 4.1840 data: 0.0001 max mem: 51710 [07:17:02.964445] Epoch: [2] [1500/1504] lr: 0.000005 closs: 0.6923 (0.8536) grad_norm: 0.2537 (0.3329) time: 4.1781 data: 0.0002 max mem: 51710 [07:17:15.926522] Epoch: [2] Total time: 1:52:15 [07:17:15.963773] Averaged stats: lr: 0.000005 closs: 0.6884 (0.8518) grad_norm: 0.2534 (0.3311) [07:18:20.563967] model saved [07:20:32.378761] optimizer saved [07:20:32.382744] other rank-common saved [07:20:32.388955] rank-specific saved [07:20:32.390498] Training time 5:47:14
These info are really helpful. Appreciated!!
Hi,
I am running the main_finetune.py on llava150k dataset. While the training loss is decreasing, what is an ideal value for the loss so that the model outputs sensible answers? For example, the c_loss decreases from 10.5 to 2.5 after 80K examples in my current training, is this rate of change expected/normal?
When I load the tensorboard to see the loss graph, the step value is 510 when the number of examples trained is 80K/150K. What is the numerical relationship between the step value and the number of iterations/examples run?
Thanks for being so patient and helpful with my questions!!