gaopengcuhk / Stable-Pix2Seq

A full-fledged version of Pix2Seq
Apache License 2.0
235 stars 20 forks source link

CUDA Out-of-memory using V100 #14

Open allanj opened 1 year ago

allanj commented 1 year ago

I'm using V100 for experiments, but still out of memory in the middle of the training process. Not sure what would be the reason at this momnet


Namespace(aux_loss=True, backbone='resnet50', batch_size=4, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path='./coco2017/', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=1024, dist_backend='nccl', dist_url='env://', distributed=True, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=300, eval=False, frozen_weights=None, giou_loss_coef=2, gpu=0, hidden_dim=256, lr=0.0005, lr_backbone=1e-05, lr_drop=200, mask_loss_coef=1, masks=False, nheads=8, num_queries=100, num_workers=2, output_dir='./output', position_embedding='sine', pre_norm=False, rank=0, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=8)
Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /home/tiger/.cache/torch/hub/checkpoints/resnet50-19c8e357.pth
100%|██████████| 97.8M/97.8M [00:09<00:00, 10.3MB/s]
number of params: 36104659
loading annotations into memory...
Done (t=13.57s)
creating index...
index created!
loading annotations into memory...
Done (t=0.44s)
creating index...
index created!
Start training
Epoch: [0]  [   0/3696]  eta: 2:32:25  lr: 0.000100  loss: 7.6000 (7.6000)  at: 7.6000 (7.6000)  at_unscaled: 7.6000 (7.6000)  time: 2.4743  data: 0.5030  max mem: 14737
Epoch: [0]  [  10/3696]  eta: 0:59:14  lr: 0.000100  loss: 7.5261 (7.5307)  at: 7.5261 (7.5307)  at_unscaled: 7.5261 (7.5307)  time: 0.9643  data: 0.0806  max mem: 25656
Epoch: [0]  [  20/3696]  eta: 0:56:49  lr: 0.000100  loss: 7.4746 (7.4774)  at: 7.4746 (7.4774)  at_unscaled: 7.4746 (7.4774)  time: 0.8501  data: 0.0390  max mem: 25656
Epoch: [0]  [  30/3696]  eta: 0:54:22  lr: 0.000100  loss: 7.3449 (7.4215)  at: 7.3449 (7.4215)  at_unscaled: 7.3449 (7.4215)  time: 0.8489  data: 0.0374  max mem: 25656
Epoch: [0]  [  40/3696]  eta: 0:54:59  lr: 0.000100  loss: 7.2054 (7.3429)  at: 7.2054 (7.3429)  at_unscaled: 7.2054 (7.3429)  time: 0.8761  data: 0.0356  max mem: 25656
Epoch: [0]  [  50/3696]  eta: 0:53:30  lr: 0.000100  loss: 7.0288 (7.2657)  at: 7.0288 (7.2657)  at_unscaled: 7.0288 (7.2657)  time: 0.8662  data: 0.0362  max mem: 25656
Epoch: [0]  [  60/3696]  eta: 0:53:44  lr: 0.000100  loss: 6.8423 (7.1774)  at: 6.8423 (7.1774)  at_unscaled: 6.8423 (7.1774)  time: 0.8553  data: 0.0368  max mem: 26623
Epoch: [0]  [  70/3696]  eta: 0:53:36  lr: 0.000100  loss: 6.6867 (7.0967)  at: 6.6867 (7.0967)  at_unscaled: 6.6867 (7.0967)  time: 0.9036  data: 0.0359  max mem: 26623
Epoch: [0]  [  80/3696]  eta: 0:52:42  lr: 0.000100  loss: 6.5043 (7.0184)  at: 6.5043 (7.0184)  at_unscaled: 6.5043 (7.0184)  time: 0.8368  data: 0.0351  max mem: 26623
Epoch: [0]  [  90/3696]  eta: 0:52:17  lr: 0.000100  loss: 6.4531 (6.9577)  at: 6.4531 (6.9577)  at_unscaled: 6.4531 (6.9577)  time: 0.8094  data: 0.0362  max mem: 26623
Epoch: [0]  [ 100/3696]  eta: 0:51:33  lr: 0.000100  loss: 6.4151 (6.8982)  at: 6.4151 (6.8982)  at_unscaled: 6.4151 (6.8982)  time: 0.8019  data: 0.0386  max mem: 26623
Epoch: [0]  [ 110/3696]  eta: 0:51:10  lr: 0.000100  loss: 6.3319 (6.8437)  at: 6.3319 (6.8437)  at_unscaled: 6.3319 (6.8437)  time: 0.7937  data: 0.0392  max mem: 26623
Epoch: [0]  [ 120/3696]  eta: 0:50:56  lr: 0.000100  loss: 6.2714 (6.7969)  at: 6.2714 (6.7969)  at_unscaled: 6.2714 (6.7969)  time: 0.8268  data: 0.0377  max mem: 26623
Epoch: [0]  [ 130/3696]  eta: 0:50:36  lr: 0.000100  loss: 6.2584 (6.7519)  at: 6.2584 (6.7519)  at_unscaled: 6.2584 (6.7519)  time: 0.8254  data: 0.0372  max mem: 26623
Epoch: [0]  [ 140/3696]  eta: 0:50:25  lr: 0.000100  loss: 6.2035 (6.7111)  at: 6.2035 (6.7111)  at_unscaled: 6.2035 (6.7111)  time: 0.8266  data: 0.0372  max mem: 29528
Epoch: [0]  [ 150/3696]  eta: 0:49:55  lr: 0.000100  loss: 6.1476 (6.6716)  at: 6.1476 (6.6716)  at_unscaled: 6.1476 (6.6716)  time: 0.8011  data: 0.0375  max mem: 29528
Epoch: [0]  [ 160/3696]  eta: 0:49:27  lr: 0.000100  loss: 6.0711 (6.6330)  at: 6.0711 (6.6330)  at_unscaled: 6.0711 (6.6330)  time: 0.7585  data: 0.0372  max mem: 29528
Epoch: [0]  [ 170/3696]  eta: 0:49:10  lr: 0.000100  loss: 6.0247 (6.5969)  at: 6.0247 (6.5969)  at_unscaled: 6.0247 (6.5969)  time: 0.7769  data: 0.0358  max mem: 29528
Epoch: [0]  [ 180/3696]  eta: 0:49:27  lr: 0.000100  loss: 5.9822 (6.5631)  at: 5.9822 (6.5631)  at_unscaled: 5.9822 (6.5631)  time: 0.8812  data: 0.0361  max mem: 29528
Epoch: [0]  [ 190/3696]  eta: 0:49:06  lr: 0.000100  loss: 5.9351 (6.5278)  at: 5.9351 (6.5278)  at_unscaled: 5.9351 (6.5278)  time: 0.8712  data: 0.0371  max mem: 29528
Epoch: [0]  [ 200/3696]  eta: 0:48:45  lr: 0.000100  loss: 5.8904 (6.4953)  at: 5.8904 (6.4953)  at_unscaled: 5.8904 (6.4953)  time: 0.7744  data: 0.0355  max mem: 29528
Epoch: [0]  [ 210/3696]  eta: 0:48:35  lr: 0.000100  loss: 5.8645 (6.4635)  at: 5.8645 (6.4635)  at_unscaled: 5.8645 (6.4635)  time: 0.7968  data: 0.0348  max mem: 29528
Epoch: [0]  [ 220/3696]  eta: 0:48:17  lr: 0.000100  loss: 5.8032 (6.4343)  at: 5.8032 (6.4343)  at_unscaled: 5.8032 (6.4343)  time: 0.7998  data: 0.0354  max mem: 29528
Epoch: [0]  [ 230/3696]  eta: 0:47:58  lr: 0.000100  loss: 5.7949 (6.4067)  at: 5.7949 (6.4067)  at_unscaled: 5.7949 (6.4067)  time: 0.7687  data: 0.0362  max mem: 29528
Epoch: [0]  [ 240/3696]  eta: 0:47:45  lr: 0.000100  loss: 5.7568 (6.3776)  at: 5.7568 (6.3776)  at_unscaled: 5.7568 (6.3776)  time: 0.7808  data: 0.0371  max mem: 29528
Epoch: [0]  [ 250/3696]  eta: 0:47:30  lr: 0.000100  loss: 5.7063 (6.3502)  at: 5.7063 (6.3502)  at_unscaled: 5.7063 (6.3502)  time: 0.7889  data: 0.0366  max mem: 29528
Epoch: [0]  [ 260/3696]  eta: 0:47:11  lr: 0.000100  loss: 5.6821 (6.3225)  at: 5.6821 (6.3225)  at_unscaled: 5.6821 (6.3225)  time: 0.7617  data: 0.0362  max mem: 29528
Epoch: [0]  [ 270/3696]  eta: 0:47:00  lr: 0.000100  loss: 5.6091 (6.2965)  at: 5.6091 (6.2965)  at_unscaled: 5.6091 (6.2965)  time: 0.7725  data: 0.0366  max mem: 29528
Epoch: [0]  [ 280/3696]  eta: 0:46:48  lr: 0.000100  loss: 5.6024 (6.2713)  at: 5.6024 (6.2713)  at_unscaled: 5.6024 (6.2713)  time: 0.7982  data: 0.0366  max mem: 29528
Epoch: [0]  [ 290/3696]  eta: 0:46:48  lr: 0.000100  loss: 5.5578 (6.2455)  at: 5.5578 (6.2455)  at_unscaled: 5.5578 (6.2455)  time: 0.8433  data: 0.0370  max mem: 29528
Epoch: [0]  [ 300/3696]  eta: 0:46:36  lr: 0.000100  loss: 5.5396 (6.2221)  at: 5.5396 (6.2221)  at_unscaled: 5.5396 (6.2221)  time: 0.8398  data: 0.0373  max mem: 29528
Epoch: [0]  [ 310/3696]  eta: 0:46:23  lr: 0.000100  loss: 5.5059 (6.1994)  at: 5.5059 (6.1994)  at_unscaled: 5.5059 (6.1994)  time: 0.7842  data: 0.0374  max mem: 29528
Epoch: [0]  [ 320/3696]  eta: 0:46:12  lr: 0.000100  loss: 5.4888 (6.1767)  at: 5.4888 (6.1767)  at_unscaled: 5.4888 (6.1767)  time: 0.7882  data: 0.0370  max mem: 29528
Epoch: [0]  [ 330/3696]  eta: 0:45:58  lr: 0.000100  loss: 5.4756 (6.1560)  at: 5.4756 (6.1560)  at_unscaled: 5.4756 (6.1560)  time: 0.7820  data: 0.0365  max mem: 29528
Epoch: [0]  [ 340/3696]  eta: 0:45:49  lr: 0.000100  loss: 5.4458 (6.1354)  at: 5.4458 (6.1354)  at_unscaled: 5.4458 (6.1354)  time: 0.7886  data: 0.0363  max mem: 29528
Epoch: [0]  [ 350/3696]  eta: 0:45:42  lr: 0.000100  loss: 5.4504 (6.1157)  at: 5.4504 (6.1157)  at_unscaled: 5.4504 (6.1157)  time: 0.8230  data: 0.0364  max mem: 29528
Epoch: [0]  [ 360/3696]  eta: 0:45:34  lr: 0.000100  loss: 5.4683 (6.0973)  at: 5.4683 (6.0973)  at_unscaled: 5.4683 (6.0973)  time: 0.8292  data: 0.0370  max mem: 29528
Epoch: [0]  [ 370/3696]  eta: 0:45:30  lr: 0.000100  loss: 5.4665 (6.0802)  at: 5.4665 (6.0802)  at_unscaled: 5.4665 (6.0802)  time: 0.8410  data: 0.0357  max mem: 29528
Epoch: [0]  [ 380/3696]  eta: 0:45:22  lr: 0.000100  loss: 5.4943 (6.0647)  at: 5.4943 (6.0647)  at_unscaled: 5.4943 (6.0647)  time: 0.8443  data: 0.0360  max mem: 29528
Epoch: [0]  [ 390/3696]  eta: 0:45:13  lr: 0.000100  loss: 5.4801 (6.0489)  at: 5.4801 (6.0489)  at_unscaled: 5.4801 (6.0489)  time: 0.8209  data: 0.0371  max mem: 29528
Epoch: [0]  [ 400/3696]  eta: 0:45:14  lr: 0.000100  loss: 5.4442 (6.0338)  at: 5.4442 (6.0338)  at_unscaled: 5.4442 (6.0338)  time: 0.8706  data: 0.0372  max mem: 29528
Epoch: [0]  [ 410/3696]  eta: 0:45:03  lr: 0.000100  loss: 5.4351 (6.0182)  at: 5.4351 (6.0182)  at_unscaled: 5.4351 (6.0182)  time: 0.8613  data: 0.0376  max mem: 29528
Epoch: [0]  [ 420/3696]  eta: 0:44:50  lr: 0.000100  loss: 5.3845 (6.0028)  at: 5.3845 (6.0028)  at_unscaled: 5.3845 (6.0028)  time: 0.7759  data: 0.0373  max mem: 29528
Epoch: [0]  [ 430/3696]  eta: 0:45:03  lr: 0.000100  loss: 5.3922 (5.9884)  at: 5.3922 (5.9884)  at_unscaled: 5.3922 (5.9884)  time: 0.9318  data: 0.0361  max mem: 29528
Epoch: [0]  [ 440/3696]  eta: 0:44:50  lr: 0.000100  loss: 5.4115 (5.9759)  at: 5.4115 (5.9759)  at_unscaled: 5.4115 (5.9759)  time: 0.9331  data: 0.0361  max mem: 29528
Epoch: [0]  [ 450/3696]  eta: 0:44:43  lr: 0.000100  loss: 5.4180 (5.9631)  at: 5.4180 (5.9631)  at_unscaled: 5.4180 (5.9631)  time: 0.8017  data: 0.0359  max mem: 29528
Epoch: [0]  [ 460/3696]  eta: 0:44:29  lr: 0.000100  loss: 5.3881 (5.9501)  at: 5.3881 (5.9501)  at_unscaled: 5.3881 (5.9501)  time: 0.7948  data: 0.0355  max mem: 29528
Epoch: [0]  [ 470/3696]  eta: 0:44:18  lr: 0.000100  loss: 5.3906 (5.9391)  at: 5.3906 (5.9391)  at_unscaled: 5.3906 (5.9391)  time: 0.7668  data: 0.0371  max mem: 29528
Epoch: [0]  [ 480/3696]  eta: 0:44:10  lr: 0.000100  loss: 5.3906 (5.9277)  at: 5.3906 (5.9277)  at_unscaled: 5.3906 (5.9277)  time: 0.8013  data: 0.0390  max mem: 29528
Epoch: [0]  [ 490/3696]  eta: 0:44:03  lr: 0.000100  loss: 5.4143 (5.9179)  at: 5.4143 (5.9179)  at_unscaled: 5.4143 (5.9179)  time: 0.8300  data: 0.0391  max mem: 29528
Epoch: [0]  [ 500/3696]  eta: 0:43:54  lr: 0.000100  loss: 5.4093 (5.9075)  at: 5.4093 (5.9075)  at_unscaled: 5.4093 (5.9075)  time: 0.8303  data: 0.0378  max mem: 29528
Epoch: [0]  [ 510/3696]  eta: 0:43:43  lr: 0.000100  loss: 5.3890 (5.8972)  at: 5.3890 (5.8972)  at_unscaled: 5.3890 (5.8972)  time: 0.7958  data: 0.0367  max mem: 29528
Epoch: [0]  [ 520/3696]  eta: 0:43:31  lr: 0.000100  loss: 5.3959 (5.8872)  at: 5.3959 (5.8872)  at_unscaled: 5.3959 (5.8872)  time: 0.7730  data: 0.0355  max mem: 29528
Epoch: [0]  [ 530/3696]  eta: 0:43:22  lr: 0.000100  loss: 5.3743 (5.8775)  at: 5.3743 (5.8775)  at_unscaled: 5.3743 (5.8775)  time: 0.7915  data: 0.0358  max mem: 29528
Epoch: [0]  [ 540/3696]  eta: 0:43:12  lr: 0.000100  loss: 5.3725 (5.8675)  at: 5.3725 (5.8675)  at_unscaled: 5.3725 (5.8675)  time: 0.8013  data: 0.0355  max mem: 29528
Epoch: [0]  [ 550/3696]  eta: 0:43:02  lr: 0.000100  loss: 5.3403 (5.8580)  at: 5.3403 (5.8580)  at_unscaled: 5.3403 (5.8580)  time: 0.7922  data: 0.0349  max mem: 29528
Epoch: [0]  [ 560/3696]  eta: 0:42:52  lr: 0.000100  loss: 5.3460 (5.8494)  at: 5.3460 (5.8494)  at_unscaled: 5.3460 (5.8494)  time: 0.7893  data: 0.0355  max mem: 29528
Epoch: [0]  [ 570/3696]  eta: 0:42:43  lr: 0.000100  loss: 5.3509 (5.8408)  at: 5.3509 (5.8408)  at_unscaled: 5.3509 (5.8408)  time: 0.7901  data: 0.0359  max mem: 29528
Epoch: [0]  [ 580/3696]  eta: 0:42:31  lr: 0.000100  loss: 5.3509 (5.8328)  at: 5.3509 (5.8328)  at_unscaled: 5.3509 (5.8328)  time: 0.7762  data: 0.0358  max mem: 29528
Epoch: [0]  [ 590/3696]  eta: 0:42:22  lr: 0.000100  loss: 5.3572 (5.8243)  at: 5.3572 (5.8243)  at_unscaled: 5.3572 (5.8243)  time: 0.7785  data: 0.0351  max mem: 29528
Epoch: [0]  [ 600/3696]  eta: 0:42:11  lr: 0.000100  loss: 5.3541 (5.8163)  at: 5.3541 (5.8163)  at_unscaled: 5.3541 (5.8163)  time: 0.7857  data: 0.0343  max mem: 29528
Epoch: [0]  [ 610/3696]  eta: 0:41:59  lr: 0.000100  loss: 5.3445 (5.8085)  at: 5.3445 (5.8085)  at_unscaled: 5.3445 (5.8085)  time: 0.7585  data: 0.0351  max mem: 29528
Epoch: [0]  [ 620/3696]  eta: 0:41:54  lr: 0.000100  loss: 5.3499 (5.8015)  at: 5.3499 (5.8015)  at_unscaled: 5.3499 (5.8015)  time: 0.8055  data: 0.0354  max mem: 29528
Epoch: [0]  [ 630/3696]  eta: 0:41:42  lr: 0.000100  loss: 5.3499 (5.7940)  at: 5.3499 (5.7940)  at_unscaled: 5.3499 (5.7940)  time: 0.8031  data: 0.0343  max mem: 29528
Epoch: [0]  [ 640/3696]  eta: 0:41:31  lr: 0.000100  loss: 5.3273 (5.7865)  at: 5.3273 (5.7865)  at_unscaled: 5.3273 (5.7865)  time: 0.7553  data: 0.0356  max mem: 29528
Epoch: [0]  [ 650/3696]  eta: 0:41:22  lr: 0.000100  loss: 5.3314 (5.7792)  at: 5.3314 (5.7792)  at_unscaled: 5.3314 (5.7792)  time: 0.7825  data: 0.0378  max mem: 29528
Epoch: [0]  [ 660/3696]  eta: 0:41:16  lr: 0.000100  loss: 5.3259 (5.7719)  at: 5.3259 (5.7719)  at_unscaled: 5.3259 (5.7719)  time: 0.8199  data: 0.0371  max mem: 29528
Epoch: [0]  [ 670/3696]  eta: 0:41:06  lr: 0.000100  loss: 5.2930 (5.7651)  at: 5.2930 (5.7651)  at_unscaled: 5.2930 (5.7651)  time: 0.8170  data: 0.0351  max mem: 29528
Epoch: [0]  [ 680/3696]  eta: 0:40:57  lr: 0.000100  loss: 5.2930 (5.7582)  at: 5.2930 (5.7582)  at_unscaled: 5.2930 (5.7582)  time: 0.7851  data: 0.0354  max mem: 29528
Epoch: [0]  [ 690/3696]  eta: 0:40:49  lr: 0.000100  loss: 5.2727 (5.7514)  at: 5.2727 (5.7514)  at_unscaled: 5.2727 (5.7514)  time: 0.8068  data: 0.0353  max mem: 29528
Epoch: [0]  [ 700/3696]  eta: 0:40:41  lr: 0.000100  loss: 5.2917 (5.7451)  at: 5.2917 (5.7451)  at_unscaled: 5.2917 (5.7451)  time: 0.8184  data: 0.0348  max mem: 29528
Epoch: [0]  [ 710/3696]  eta: 0:40:31  lr: 0.000100  loss: 5.2949 (5.7387)  at: 5.2949 (5.7387)  at_unscaled: 5.2949 (5.7387)  time: 0.7904  data: 0.0358  max mem: 29528
Epoch: [0]  [ 720/3696]  eta: 0:40:21  lr: 0.000100  loss: 5.2874 (5.7325)  at: 5.2874 (5.7325)  at_unscaled: 5.2874 (5.7325)  time: 0.7719  data: 0.0376  max mem: 29528
Epoch: [0]  [ 730/3696]  eta: 0:40:10  lr: 0.000100  loss: 5.2801 (5.7262)  at: 5.2801 (5.7262)  at_unscaled: 5.2801 (5.7262)  time: 0.7581  data: 0.0372  max mem: 29528
Epoch: [0]  [ 740/3696]  eta: 0:40:02  lr: 0.000100  loss: 5.2634 (5.7196)  at: 5.2634 (5.7196)  at_unscaled: 5.2634 (5.7196)  time: 0.7769  data: 0.0357  max mem: 29528
Epoch: [0]  [ 750/3696]  eta: 0:39:53  lr: 0.000100  loss: 5.2367 (5.7135)  at: 5.2367 (5.7135)  at_unscaled: 5.2367 (5.7135)  time: 0.8039  data: 0.0365  max mem: 29528
Epoch: [0]  [ 760/3696]  eta: 0:39:43  lr: 0.000100  loss: 5.2874 (5.7082)  at: 5.2874 (5.7082)  at_unscaled: 5.2874 (5.7082)  time: 0.7800  data: 0.0367  max mem: 29528
Epoch: [0]  [ 770/3696]  eta: 0:39:33  lr: 0.000100  loss: 5.2954 (5.7024)  at: 5.2954 (5.7024)  at_unscaled: 5.2954 (5.7024)  time: 0.7681  data: 0.0356  max mem: 29528
Epoch: [0]  [ 780/3696]  eta: 0:39:23  lr: 0.000100  loss: 5.3127 (5.6975)  at: 5.3127 (5.6975)  at_unscaled: 5.3127 (5.6975)  time: 0.7632  data: 0.0361  max mem: 29528
Epoch: [0]  [ 790/3696]  eta: 0:39:14  lr: 0.000100  loss: 5.3130 (5.6919)  at: 5.3130 (5.6919)  at_unscaled: 5.3130 (5.6919)  time: 0.7715  data: 0.0359  max mem: 29528
Epoch: [0]  [ 800/3696]  eta: 0:39:06  lr: 0.000100  loss: 5.2498 (5.6860)  at: 5.2498 (5.6860)  at_unscaled: 5.2498 (5.6860)  time: 0.7954  data: 0.0369  max mem: 29528
Epoch: [0]  [ 810/3696]  eta: 0:38:58  lr: 0.000100  loss: 5.2336 (5.6804)  at: 5.2336 (5.6804)  at_unscaled: 5.2336 (5.6804)  time: 0.8095  data: 0.0380  max mem: 29528
Epoch: [0]  [ 820/3696]  eta: 0:38:50  lr: 0.000100  loss: 5.2354 (5.6755)  at: 5.2354 (5.6755)  at_unscaled: 5.2354 (5.6755)  time: 0.8130  data: 0.0356  max mem: 29528
Epoch: [0]  [ 830/3696]  eta: 0:38:39  lr: 0.000100  loss: 5.2691 (5.6704)  at: 5.2691 (5.6704)  at_unscaled: 5.2691 (5.6704)  time: 0.7757  data: 0.0355  max mem: 29528
Epoch: [0]  [ 840/3696]  eta: 0:38:31  lr: 0.000100  loss: 5.2588 (5.6653)  at: 5.2588 (5.6653)  at_unscaled: 5.2588 (5.6653)  time: 0.7692  data: 0.0369  max mem: 29528
Epoch: [0]  [ 850/3696]  eta: 0:38:23  lr: 0.000100  loss: 5.2564 (5.6606)  at: 5.2564 (5.6606)  at_unscaled: 5.2564 (5.6606)  time: 0.8133  data: 0.0363  max mem: 29528
Epoch: [0]  [ 860/3696]  eta: 0:38:15  lr: 0.000100  loss: 5.2448 (5.6556)  at: 5.2448 (5.6556)  at_unscaled: 5.2448 (5.6556)  time: 0.8129  data: 0.0352  max mem: 29528
Epoch: [0]  [ 870/3696]  eta: 0:38:05  lr: 0.000100  loss: 5.2326 (5.6506)  at: 5.2326 (5.6506)  at_unscaled: 5.2326 (5.6506)  time: 0.7795  data: 0.0351  max mem: 29528
Epoch: [0]  [ 880/3696]  eta: 0:37:56  lr: 0.000100  loss: 5.2049 (5.6456)  at: 5.2049 (5.6456)  at_unscaled: 5.2049 (5.6456)  time: 0.7750  data: 0.0364  max mem: 29528
Epoch: [0]  [ 890/3696]  eta: 0:37:47  lr: 0.000100  loss: 5.2049 (5.6407)  at: 5.2049 (5.6407)  at_unscaled: 5.2049 (5.6407)  time: 0.7812  data: 0.0367  max mem: 29528
Epoch: [0]  [ 900/3696]  eta: 0:37:37  lr: 0.000100  loss: 5.1690 (5.6354)  at: 5.1690 (5.6354)  at_unscaled: 5.1690 (5.6354)  time: 0.7607  data: 0.0348  max mem: 29528
Epoch: [0]  [ 910/3696]  eta: 0:37:31  lr: 0.000100  loss: 5.1836 (5.6309)  at: 5.1836 (5.6309)  at_unscaled: 5.1836 (5.6309)  time: 0.8035  data: 0.0355  max mem: 29528
Epoch: [0]  [ 920/3696]  eta: 0:37:22  lr: 0.000100  loss: 5.2129 (5.6261)  at: 5.2129 (5.6261)  at_unscaled: 5.2129 (5.6261)  time: 0.8221  data: 0.0381  max mem: 29528
Epoch: [0]  [ 930/3696]  eta: 0:37:13  lr: 0.000100  loss: 5.1586 (5.6210)  at: 5.1586 (5.6210)  at_unscaled: 5.1586 (5.6210)  time: 0.7758  data: 0.0377  max mem: 29528
Epoch: [0]  [ 940/3696]  eta: 0:37:05  lr: 0.000100  loss: 5.1586 (5.6162)  at: 5.1586 (5.6162)  at_unscaled: 5.1586 (5.6162)  time: 0.7975  data: 0.0355  max mem: 29528
Epoch: [0]  [ 950/3696]  eta: 0:36:56  lr: 0.000100  loss: 5.1713 (5.6120)  at: 5.1713 (5.6120)  at_unscaled: 5.1713 (5.6120)  time: 0.7970  data: 0.0358  max mem: 29528
Epoch: [0]  [ 960/3696]  eta: 0:36:47  lr: 0.000100  loss: 5.1839 (5.6077)  at: 5.1839 (5.6077)  at_unscaled: 5.1839 (5.6077)  time: 0.7714  data: 0.0367  max mem: 29528
Epoch: [0]  [ 970/3696]  eta: 0:36:38  lr: 0.000100  loss: 5.1800 (5.6036)  at: 5.1800 (5.6036)  at_unscaled: 5.1800 (5.6036)  time: 0.7812  data: 0.0363  max mem: 29528
Epoch: [0]  [ 980/3696]  eta: 0:36:30  lr: 0.000100  loss: 5.2028 (5.5995)  at: 5.2028 (5.5995)  at_unscaled: 5.2028 (5.5995)  time: 0.7996  data: 0.0349  max mem: 29528
Epoch: [0]  [ 990/3696]  eta: 0:36:23  lr: 0.000100  loss: 5.2028 (5.5954)  at: 5.2028 (5.5954)  at_unscaled: 5.2028 (5.5954)  time: 0.8110  data: 0.0353  max mem: 29528
Epoch: [0]  [1000/3696]  eta: 0:36:14  lr: 0.000100  loss: 5.1880 (5.5914)  at: 5.1880 (5.5914)  at_unscaled: 5.1880 (5.5914)  time: 0.7950  data: 0.0369  max mem: 29528
Epoch: [0]  [1010/3696]  eta: 0:36:04  lr: 0.000100  loss: 5.1773 (5.5870)  at: 5.1773 (5.5870)  at_unscaled: 5.1773 (5.5870)  time: 0.7645  data: 0.0368  max mem: 29528
Epoch: [0]  [1020/3696]  eta: 0:35:57  lr: 0.000100  loss: 5.2493 (5.5836)  at: 5.2493 (5.5836)  at_unscaled: 5.2493 (5.5836)  time: 0.7915  data: 0.0360  max mem: 29528
Epoch: [0]  [1030/3696]  eta: 0:35:49  lr: 0.000100  loss: 5.1982 (5.5793)  at: 5.1982 (5.5793)  at_unscaled: 5.1982 (5.5793)  time: 0.8164  data: 0.0363  max mem: 29528
Epoch: [0]  [1040/3696]  eta: 0:35:41  lr: 0.000100  loss: 5.1446 (5.5754)  at: 5.1446 (5.5754)  at_unscaled: 5.1446 (5.5754)  time: 0.8053  data: 0.0375  max mem: 29528
Epoch: [0]  [1050/3696]  eta: 0:35:31  lr: 0.000100  loss: 5.1319 (5.5714)  at: 5.1319 (5.5714)  at_unscaled: 5.1319 (5.5714)  time: 0.7766  data: 0.0359  max mem: 29528
Epoch: [0]  [1060/3696]  eta: 0:35:22  lr: 0.000100  loss: 5.2017 (5.5679)  at: 5.2017 (5.5679)  at_unscaled: 5.2017 (5.5679)  time: 0.7481  data: 0.0365  max mem: 29528
Epoch: [0]  [1070/3696]  eta: 0:35:13  lr: 0.000100  loss: 5.2017 (5.5642)  at: 5.2017 (5.5642)  at_unscaled: 5.2017 (5.5642)  time: 0.7754  data: 0.0387  max mem: 29528
Epoch: [0]  [1080/3696]  eta: 0:35:03  lr: 0.000100  loss: 5.1192 (5.5603)  at: 5.1192 (5.5603)  at_unscaled: 5.1192 (5.5603)  time: 0.7605  data: 0.0383  max mem: 29528
Epoch: [0]  [1090/3696]  eta: 0:34:56  lr: 0.000100  loss: 5.1105 (5.5560)  at: 5.1105 (5.5560)  at_unscaled: 5.1105 (5.5560)  time: 0.7700  data: 0.0379  max mem: 29528
Epoch: [0]  [1100/3696]  eta: 0:34:47  lr: 0.000100  loss: 5.1321 (5.5524)  at: 5.1321 (5.5524)  at_unscaled: 5.1321 (5.5524)  time: 0.8007  data: 0.0380  max mem: 29528
Epoch: [0]  [1110/3696]  eta: 0:34:39  lr: 0.000100  loss: 5.1603 (5.5489)  at: 5.1603 (5.5489)  at_unscaled: 5.1603 (5.5489)  time: 0.7850  data: 0.0382  max mem: 29528
Epoch: [0]  [1120/3696]  eta: 0:34:30  lr: 0.000100  loss: 5.1443 (5.5452)  at: 5.1443 (5.5452)  at_unscaled: 5.1443 (5.5452)  time: 0.7765  data: 0.0383  max mem: 29528
Epoch: [0]  [1130/3696]  eta: 0:34:21  lr: 0.000100  loss: 5.1185 (5.5413)  at: 5.1185 (5.5413)  at_unscaled: 5.1185 (5.5413)  time: 0.7790  data: 0.0372  max mem: 29528
Epoch: [0]  [1140/3696]  eta: 0:34:13  lr: 0.000100  loss: 5.0800 (5.5374)  at: 5.0800 (5.5374)  at_unscaled: 5.0800 (5.5374)  time: 0.7986  data: 0.0356  max mem: 29528
Epoch: [0]  [1150/3696]  eta: 0:34:04  lr: 0.000100  loss: 5.1101 (5.5337)  at: 5.1101 (5.5337)  at_unscaled: 5.1101 (5.5337)  time: 0.7654  data: 0.0345  max mem: 29528
Epoch: [0]  [1160/3696]  eta: 0:33:56  lr: 0.000100  loss: 5.1744 (5.5307)  at: 5.1744 (5.5307)  at_unscaled: 5.1744 (5.5307)  time: 0.7695  data: 0.0344  max mem: 29528
Epoch: [0]  [1170/3696]  eta: 0:33:47  lr: 0.000100  loss: 5.1829 (5.5277)  at: 5.1829 (5.5277)  at_unscaled: 5.1829 (5.5277)  time: 0.7968  data: 0.0362  max mem: 29528
Epoch: [0]  [1180/3696]  eta: 0:33:40  lr: 0.000100  loss: 5.1845 (5.5246)  at: 5.1845 (5.5246)  at_unscaled: 5.1845 (5.5246)  time: 0.8120  data: 0.0374  max mem: 29528
Epoch: [0]  [1190/3696]  eta: 0:33:32  lr: 0.000100  loss: 5.1798 (5.5216)  at: 5.1798 (5.5216)  at_unscaled: 5.1798 (5.5216)  time: 0.8169  data: 0.0371  max mem: 29528
Epoch: [0]  [1200/3696]  eta: 0:33:23  lr: 0.000100  loss: 5.1929 (5.5188)  at: 5.1929 (5.5188)  at_unscaled: 5.1929 (5.5188)  time: 0.7739  data: 0.0361  max mem: 29528
Epoch: [0]  [1210/3696]  eta: 0:33:16  lr: 0.000100  loss: 5.1929 (5.5158)  at: 5.1929 (5.5158)  at_unscaled: 5.1929 (5.5158)  time: 0.7985  data: 0.0340  max mem: 29528
Epoch: [0]  [1220/3696]  eta: 0:33:07  lr: 0.000100  loss: 5.1322 (5.5126)  at: 5.1322 (5.5126)  at_unscaled: 5.1322 (5.5126)  time: 0.8027  data: 0.0350  max mem: 29528
Epoch: [0]  [1230/3696]  eta: 0:32:59  lr: 0.000100  loss: 5.1595 (5.5096)  at: 5.1595 (5.5096)  at_unscaled: 5.1595 (5.5096)  time: 0.7881  data: 0.0374  max mem: 29528
Epoch: [0]  [1240/3696]  eta: 0:32:50  lr: 0.000100  loss: 5.1620 (5.5067)  at: 5.1620 (5.5067)  at_unscaled: 5.1620 (5.5067)  time: 0.7849  data: 0.0365  max mem: 29528
Epoch: [0]  [1250/3696]  eta: 0:32:42  lr: 0.000100  loss: 5.1620 (5.5038)  at: 5.1620 (5.5038)  at_unscaled: 5.1620 (5.5038)  time: 0.7893  data: 0.0357  max mem: 29528
Epoch: [0]  [1260/3696]  eta: 0:32:34  lr: 0.000100  loss: 5.1245 (5.5005)  at: 5.1245 (5.5005)  at_unscaled: 5.1245 (5.5005)  time: 0.8002  data: 0.0359  max mem: 29528
Epoch: [0]  [1270/3696]  eta: 0:32:26  lr: 0.000100  loss: 5.1023 (5.4975)  at: 5.1023 (5.4975)  at_unscaled: 5.1023 (5.4975)  time: 0.8015  data: 0.0362  max mem: 29528
Epoch: [0]  [1280/3696]  eta: 0:32:17  lr: 0.000100  loss: 5.1132 (5.4946)  at: 5.1132 (5.4946)  at_unscaled: 5.1132 (5.4946)  time: 0.7906  data: 0.0349  max mem: 29528
Epoch: [0]  [1290/3696]  eta: 0:32:09  lr: 0.000100  loss: 5.1292 (5.4918)  at: 5.1292 (5.4918)  at_unscaled: 5.1292 (5.4918)  time: 0.7743  data: 0.0334  max mem: 29528
Epoch: [0]  [1300/3696]  eta: 0:32:01  lr: 0.000100  loss: 5.1292 (5.4890)  at: 5.1292 (5.4890)  at_unscaled: 5.1292 (5.4890)  time: 0.7875  data: 0.0339  max mem: 29528
Epoch: [0]  [1310/3696]  eta: 0:31:54  lr: 0.000100  loss: 5.1232 (5.4863)  at: 5.1232 (5.4863)  at_unscaled: 5.1232 (5.4863)  time: 0.8117  data: 0.0343  max mem: 29528
Epoch: [0]  [1320/3696]  eta: 0:31:45  lr: 0.000100  loss: 5.1016 (5.4832)  at: 5.1016 (5.4832)  at_unscaled: 5.1016 (5.4832)  time: 0.8161  data: 0.0341  max mem: 29528
Epoch: [0]  [1330/3696]  eta: 0:31:38  lr: 0.000100  loss: 5.0905 (5.4805)  at: 5.0905 (5.4805)  at_unscaled: 5.0905 (5.4805)  time: 0.8149  data: 0.0343  max mem: 29528
Traceback (most recent call last):
  File "main.py", line 257, in <module>
    main(args)
  File "main.py", line 207, in main
    args.clip_max_norm, learning_rate_schedule)
  File "/opt/tiger/intro/Stable-Pix2Seq/engine.py", line 98, in train_one_epoch
    losses.backward()
  File "/home/tiger/.local/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/tiger/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 216.00 MiB (GPU 7; 31.75 GiB total capacity; 29.63 GiB already allocated; 213.75 MiB free; 29.95 GiB reserved in total by PyTorch)
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/tiger/.local/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/tiger/.local/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/tiger/.local/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'main.py', '--coco_path', './coco2017/', '--batch_size', '4', '--lr', '0.0005', '--output_dir', './output']' returned non-zero exit status 1.
Killing subprocess 5627
Killing subprocess 5628
Killing subprocess 5629
Killing subprocess 5630
Killing subprocess 5631
Killing subprocess 5632
Killing subprocess 5633
allanj commented 1 year ago

Changing 4 to 3 works for me though. 😞