Closed yongrenx closed 1 week ago
代码已是最新的,你可以尝试描述遇到的问题,我看看能不能解决
可以帮我看看有什么问题吗
| distributed init (rank 0): env://, gpu 0
[rank0]:[W1121 19:09:41.973342975 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
| distributed init (rank 3): env://, gpu 3
[rank3]:[W1121 19:09:41.177365832 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
| distributed init (rank 2): env://, gpu 2
| distributed init (rank 1): env://, gpu 1
[rank2]:[W1121 19:09:42.301023264 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank1]:[W1121 19:09:42.307820029 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
Load ckpt from ./init_weight/beit3_base_patch16_224.pth
Load state_dict by model_key = model
Position interpolate from 14x14 to 30x30
Weights of BEiT3ForVisualQuestionAnswering not initialized from pretrained model: ['fusion.img_norm.weight', 'fusion.img_norm.bias', 'fusion.text_norm.weight', 'fusion.text_norm.bias', 'fusion.dense.weight', 'fusion.dense.bias', 'head.0.weight', 'head.0.bias', 'head.1.weight', 'head.1.bias', 'head.3.weight', 'head.3.bias']
Weights from pretrained model not used in BEiT3ForVisualQuestionAnswering: ['mlm_head.weight', 'mlm_head.bias', 'mim_head.weight', 'mim_head.bias', 'beit3.encoder.layer_norm.A.weight', 'beit3.encoder.layer_norm.A.bias', 'beit3.encoder.layer_norm.B.weight', 'beit3.encoder.layer_norm.B.bias']
Load 255 image-text pairs from ./data/modal-missing-non-iid/client7-img-text.jsonl.
Load 5000 image-text pairs from ./data/vqa.rest_val.jsonl.
Load 255 image-text pairs from ./data/modal-missing-non-iid/client7-img-text.jsonl.
Global Epoch: [0] [0/5] eta: 0:00:12 lr: 0.000600 min_lr: 0.000030 loss: 215.4926 (215.4926) ce_loss: 215.4926 (215.4926) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 32768.0000 (32768.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 2.4047 data: 0.7685 max mem: 18964
Global Epoch: [0] [1/5] eta: 0:00:05 lr: 0.000600 min_lr: 0.000030 loss: 215.2807 (215.3866) ce_loss: 215.2807 (215.3866) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (24576.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.4926 data: 0.3843 max mem: 18965
Global Epoch: [0] [2/5] eta: 0:00:03 lr: 0.000600 min_lr: 0.000030 loss: 215.4926 (215.5178) ce_loss: 215.4926 (215.5178) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (19114.6667) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.1871 data: 0.2562 max mem: 18965
Global Epoch: [0] [3/5] eta: 0:00:02 lr: 0.000600 min_lr: 0.000030 loss: 215.4681 (215.5054) ce_loss: 215.4681 (215.5054) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (15360.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.0325 data: 0.1922 max mem: 18965
Global Epoch: [0] [4/5] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 215.4681 (215.3552) ce_loss: 215.4681 (215.3552) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (12697.6000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.9396 data: 0.1538 max mem: 18965
Global Epoch: [0] Total time: 0:00:04 (0.9643 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 215.4681 (215.1745) ce_loss: 215.4681 (215.1745) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (12697.6000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan)
Global Epoch: [1] [0/5] eta: 0:00:07 lr: 0.000600 min_lr: 0.000030 loss: 215.1533 (215.1533) ce_loss: 215.1533 (215.1533) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 32768.0000 (32768.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.4513 data: 0.7296 max mem: 18965
Global Epoch: [1] [1/5] eta: 0:00:04 lr: 0.000600 min_lr: 0.000030 loss: 214.8118 (214.9826) ce_loss: 214.8118 (214.9826) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (24576.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.0181 data: 0.3649 max mem: 18965
Global Epoch: [1] [2/5] eta: 0:00:02 lr: 0.000600 min_lr: 0.000030 loss: 214.8118 (214.9155) ce_loss: 214.8118 (214.9155) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (19114.6667) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.8760 data: 0.2433 max mem: 18965
Global Epoch: [1] [3/5] eta: 0:00:01 lr: 0.000600 min_lr: 0.000030 loss: 214.8118 (214.9168) ce_loss: 214.8118 (214.9168) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (15360.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.8027 data: 0.1825 max mem: 18965
Global Epoch: [1] [4/5] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 214.9208 (214.9445) ce_loss: 214.9208 (214.9445) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (12697.6000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.7564 data: 0.1460 max mem: 18965
Global Epoch: [1] Total time: 0:00:04 (0.8083 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 214.9208 (215.2461) ce_loss: 214.9208 (215.2461) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (12697.6000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan)
Global Epoch: [2] [0/5] eta: 0:00:07 lr: 0.000600 min_lr: 0.000030 loss: 215.2972 (215.2972) ce_loss: 215.2972 (215.2972) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 32768.0000 (32768.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.5053 data: 0.9381 max mem: 18965
Global Epoch: [2] [1/5] eta: 0:00:04 lr: 0.000600 min_lr: 0.000030 loss: 215.2972 (215.5354) ce_loss: 215.2972 (215.5354) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (24576.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.0682 data: 0.4691 max mem: 18965
Global Epoch: [2] [2/5] eta: 0:00:02 lr: 0.000600 min_lr: 0.000030 loss: 215.2972 (215.2773) ce_loss: 215.2972 (215.2773) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (19114.6667) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.9010 data: 0.3128 max mem: 18965
Global Epoch: [2] [3/5] eta: 0:00:01 lr: 0.000600 min_lr: 0.000030 loss: 215.2972 (215.3691) ce_loss: 215.2972 (215.3691) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (15360.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.8172 data: 0.2346 max mem: 18965
Global Epoch: [2] [4/5] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 215.6109 (215.4174) ce_loss: 215.6109 (215.4174) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (12697.6000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.7681 data: 0.1877 max mem: 18965
Global Epoch: [2] Total time: 0:00:04 (0.8123 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 215.6109 (215.2190) ce_loss: 215.6109 (215.2190) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (12697.6000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan)
Global Epoch: [3] [0/5] eta: 0:00:06 lr: 0.000600 min_lr: 0.000030 loss: 214.8503 (214.8503) ce_loss: 214.8503 (214.8503) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 32768.0000 (32768.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.3878 data: 0.7353 max mem: 18965
Global Epoch: [3] [1/5] eta: 0:00:03 lr: 0.000600 min_lr: 0.000030 loss: 214.8503 (215.1706) ce_loss: 214.8503 (215.1706) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (24576.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.9855 data: 0.3677 max mem: 18965
Global Epoch: [3] [2/5] eta: 0:00:02 lr: 0.000600 min_lr: 0.000030 loss: 214.8708 (215.0707) ce_loss: 214.8708 (215.0707) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (19114.6667) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.8454 data: 0.2451 max mem: 18965
Global Epoch: [3] [3/5] eta: 0:00:01 lr: 0.000600 min_lr: 0.000030 loss: 214.8708 (215.1463) ce_loss: 214.8708 (215.1463) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (15360.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.7753 data: 0.1839 max mem: 18965
Global Epoch: [3] [4/5] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 214.9007 (215.0972) ce_loss: 214.9007 (215.0972) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (12697.6000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.7330 data: 0.1471 max mem: 18965
Global Epoch: [3] Total time: 0:00:03 (0.7821 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 214.9007 (215.1539) ce_loss: 214.9007 (215.1539) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (12697.6000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan)
Global Epoch: [4] [0/5] eta: 0:00:06 lr: 0.000600 min_lr: 0.000030 loss: 215.2302 (215.2302) ce_loss: 215.2302 (215.2302) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 32768.0000 (32768.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.3065 data: 0.7379 max mem: 18965
Global Epoch: [4] [1/5] eta: 0:00:03 lr: 0.000600 min_lr: 0.000030 loss: 215.2302 (215.3452) ce_loss: 215.2302 (215.3452) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (24576.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.9420 data: 0.3690 max mem: 18965
Global Epoch: [4] [2/5] eta: 0:00:02 lr: 0.000600 min_lr: 0.000030 loss: 215.2302 (215.2216) ce_loss: 215.2302 (215.2216) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (19114.6667) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.8176 data: 0.2460 max mem: 18965
Global Epoch: [4] [3/5] eta: 0:00:01 lr: 0.000600 min_lr: 0.000030 loss: 215.2302 (215.2778) ce_loss: 215.2302 (215.2778) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (15360.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.7553 data: 0.1845 max mem: 18965
Global Epoch: [4] [4/5] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 215.4464 (215.3375) ce_loss: 215.4464 (215.3375) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (12697.6000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.7179 data: 0.1476 max mem: 18965
Global Epoch: [4] Total time: 0:00:03 (0.7667 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 215.4464 (215.2650) ce_loss: 215.4464 (215.2650) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (12697.6000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan)
Load ckpt from ./init_weight/beit3_base_patch16_224.pth
Load state_dict by model_key = model
Position interpolate from 14x14 to 30x30
Weights of BEiT3ForVisualQuestionAnswering not initialized from pretrained model: ['fusion.img_norm.weight', 'fusion.img_norm.bias', 'fusion.text_norm.weight', 'fusion.text_norm.bias', 'fusion.dense.weight', 'fusion.dense.bias', 'head.0.weight', 'head.0.bias', 'head.1.weight', 'head.1.bias', 'head.3.weight', 'head.3.bias']
Weights from pretrained model not used in BEiT3ForVisualQuestionAnswering: ['mlm_head.weight', 'mlm_head.bias', 'mim_head.weight', 'mim_head.bias', 'beit3.encoder.layer_norm.A.weight', 'beit3.encoder.layer_norm.A.bias', 'beit3.encoder.layer_norm.B.weight', 'beit3.encoder.layer_norm.B.bias']
Load 365 image-text pairs from ./data/modal-missing-non-iid/client1-img-text.jsonl.
Load 5000 image-text pairs from ./data/vqa.rest_val.jsonl.
Load 365 image-text pairs from ./data/modal-missing-non-iid/client1-img-text.jsonl.
Global Epoch: [0] [0/7] eta: 0:00:14 lr: 0.000600 min_lr: 0.000030 loss: 221.6281 (221.6281) ce_loss: 221.6281 (221.6281) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 32768.0000 (32768.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 2.1297 data: 1.4842 max mem: 18965
Global Epoch: [0] [1/7] eta: 0:00:08 lr: 0.000600 min_lr: 0.000030 loss: 221.6281 (221.7444) ce_loss: 221.6281 (221.7444) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (24576.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.3509 data: 0.7421 max mem: 18965
Global Epoch: [0] [2/7] eta: 0:00:05 lr: 0.000600 min_lr: 0.000030 loss: 221.6281 (221.6842) ce_loss: 221.6281 (221.6842) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (19114.6667) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.0941 data: 0.4948 max mem: 18965
Global Epoch: [0] [3/7] eta: 0:00:03 lr: 0.000600 min_lr: 0.000030 loss: 221.5824 (221.6588) ce_loss: 221.5824 (221.6588) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (15360.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.9633 data: 0.3711 max mem: 18965
Global Epoch: [0] [4/7] eta: 0:00:02 lr: 0.000600 min_lr: 0.000030 loss: 221.5824 (221.6075) ce_loss: 221.5824 (221.6075) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (12697.6000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.8838 data: 0.2969 max mem: 18965
Global Epoch: [0] [5/7] eta: 0:00:01 lr: 0.000600 min_lr: 0.000030 loss: 221.5638 (221.5924) ce_loss: 221.5638 (221.5924) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 4096.0000 (10922.6667) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.8485 data: 0.2474 max mem: 18965
Global Epoch: [0] [6/7] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 221.5638 (211.1339) ce_loss: 221.5638 (211.1339) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 4096.0000 (9654.8571) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.8108 data: 0.2121 max mem: 20688
Global Epoch: [0] Total time: 0:00:05 (0.8460 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 221.5638 (211.0493) ce_loss: 221.5638 (211.0493) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 4096.0000 (9654.8571) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan)
Global Epoch: [1] [0/7] eta: 0:00:10 lr: 0.000600 min_lr: 0.000030 loss: 97.1743 (97.1743) ce_loss: 97.1743 (97.1743) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 32768.0000 (32768.0000) weight_decay: 0.0001 (0.0001) grad_norm: inf (inf) time: 1.5370 data: 0.9114 max mem: 20688
Global Epoch: [1] [1/7] eta: 0:00:06 lr: 0.000600 min_lr: 0.000030 loss: 97.1743 (97.6943) ce_loss: 97.1743 (97.6943) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (24576.0000) weight_decay: 0.0001 (0.0001) grad_norm: inf (inf) time: 1.0528 data: 0.4558 max mem: 20688
Global Epoch: [1] [2/7] eta: 0:00:04 lr: 0.000600 min_lr: 0.000030 loss: 97.9733 (97.7873) ce_loss: 97.9733 (97.7873) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (19114.6667) weight_decay: 0.0001 (0.0001) grad_norm: inf (inf) time: 0.8931 data: 0.3039 max mem: 20688
Global Epoch: [1] [3/7] eta: 0:00:03 lr: 0.000600 min_lr: 0.000030 loss: 97.9733 (97.8713) ce_loss: 97.9733 (97.8713) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (16384.0000) weight_decay: 0.0001 (0.0001) grad_norm: inf (inf) time: 0.8184 data: 0.2279 max mem: 20688
Global Epoch: [1] [4/7] eta: 0:00:02 lr: 0.000600 min_lr: 0.000030 loss: 97.9733 (91.0852) ce_loss: 97.9733 (91.0852) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (14745.6000) weight_decay: 0.0001 (0.0001) grad_norm: inf (inf) time: 0.7727 data: 0.1824 max mem: 20688
Global Epoch: [1] [5/7] eta: 0:00:01 lr: 0.000600 min_lr: 0.000030 loss: 97.1743 (82.7252) ce_loss: 97.1743 (82.7252) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (13653.3333) weight_decay: 0.0001 (0.0001) grad_norm: 147.4982 (inf) time: 0.7426 data: 0.1520 max mem: 20688
Global Epoch: [1] [6/7] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 97.1743 (74.9574) ce_loss: 97.1743 (74.9574) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (12873.1429) weight_decay: 0.0001 (0.0001) grad_norm: 147.4982 (inf) time: 0.7216 data: 0.1303 max mem: 20688
Global Epoch: [1] Total time: 0:00:05 (0.7431 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 97.1743 (75.0668) ce_loss: 97.1743 (75.0668) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (12873.1429) weight_decay: 0.0001 (0.0001) grad_norm: 147.4982 (inf)
Global Epoch: [2] [0/7] eta: 0:00:10 lr: 0.000600 min_lr: 0.000030 loss: 19.9189 (19.9189) ce_loss: 19.9189 (19.9189) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 29.5659 (29.5659) time: 1.5648 data: 0.9713 max mem: 20688
Global Epoch: [2] [1/7] eta: 0:00:06 lr: 0.000600 min_lr: 0.000030 loss: 15.1009 (17.5099) ce_loss: 15.1009 (17.5099) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 21.9824 (25.7742) time: 1.0788 data: 0.4857 max mem: 20688
Global Epoch: [2] [2/7] eta: 0:00:04 lr: 0.000600 min_lr: 0.000030 loss: 15.1009 (15.5517) ce_loss: 15.1009 (15.5517) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 21.9824 (21.2771) time: 0.9159 data: 0.3238 max mem: 20688
Global Epoch: [2] [3/7] eta: 0:00:03 lr: 0.000600 min_lr: 0.000030 loss: 11.6354 (13.8066) ce_loss: 11.6354 (13.8066) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 12.2829 (18.1327) time: 0.8343 data: 0.2429 max mem: 20688
Global Epoch: [2] [4/7] eta: 0:00:02 lr: 0.000600 min_lr: 0.000030 loss: 11.6354 (12.4827) ce_loss: 11.6354 (12.4827) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 12.2829 (15.7667) time: 0.7859 data: 0.1943 max mem: 20688
Global Epoch: [2] [5/7] eta: 0:00:01 lr: 0.000600 min_lr: 0.000030 loss: 8.5711 (11.3981) ce_loss: 8.5711 (11.3981) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 8.6997 (14.0813) time: 0.7550 data: 0.1620 max mem: 20688
Global Epoch: [2] [6/7] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 8.5711 (10.7365) ce_loss: 8.5711 (10.7365) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 8.6997 (13.1349) time: 0.7315 data: 0.1388 max mem: 20688
Global Epoch: [2] Total time: 0:00:05 (0.7528 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 8.5711 (10.2501) ce_loss: 8.5711 (10.2501) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 8.6997 (13.1349)
Global Epoch: [3] [0/7] eta: 0:00:12 lr: 0.000600 min_lr: 0.000030 loss: 5.4811 (5.4811) ce_loss: 5.4811 (5.4811) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.6083 (5.6083) time: 1.7163 data: 1.0286 max mem: 20688
Global Epoch: [3] [1/7] eta: 0:00:06 lr: 0.000600 min_lr: 0.000030 loss: 5.4811 (5.7907) ce_loss: 5.4811 (5.7907) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.6083 (6.8227) time: 1.1543 data: 0.5144 max mem: 20688
Global Epoch: [3] [2/7] eta: 0:00:04 lr: 0.000600 min_lr: 0.000030 loss: 6.1004 (6.3381) ce_loss: 6.1004 (6.3381) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 8.0372 (7.2595) time: 0.9676 data: 0.3429 max mem: 20688
Global Epoch: [3] [3/7] eta: 0:00:03 lr: 0.000600 min_lr: 0.000030 loss: 5.4811 (5.8912) ce_loss: 5.4811 (5.8912) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.6083 (6.6321) time: 0.8742 data: 0.2572 max mem: 20688
Global Epoch: [3] [4/7] eta: 0:00:02 lr: 0.000600 min_lr: 0.000030 loss: 5.4811 (5.7098) ce_loss: 5.4811 (5.7098) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 7.8914 (6.8839) time: 0.8178 data: 0.2058 max mem: 20688
Global Epoch: [3] [5/7] eta: 0:00:01 lr: 0.000600 min_lr: 0.000030 loss: 4.9838 (5.4080) ce_loss: 4.9838 (5.4080) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.6155 (6.6725) time: 0.7801 data: 0.1715 max mem: 20688
Global Epoch: [3] [6/7] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 4.9838 (5.2308) ce_loss: 4.9838 (5.2308) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 6.4635 (6.6427) time: 0.7534 data: 0.1470 max mem: 20688
Global Epoch: [3] Total time: 0:00:05 (0.7813 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 4.9838 (5.5885) ce_loss: 4.9838 (5.5885) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 6.4635 (6.6427)
Global Epoch: [4] [0/7] eta: 0:00:10 lr: 0.000600 min_lr: 0.000030 loss: 6.1900 (6.1900) ce_loss: 6.1900 (6.1900) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 4.2300 (4.2300) time: 1.5441 data: 0.9371 max mem: 20688
Global Epoch: [4] [1/7] eta: 0:00:06 lr: 0.000600 min_lr: 0.000030 loss: 4.1064 (5.1482) ce_loss: 4.1064 (5.1482) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 4.2300 (5.6808) time: 1.0679 data: 0.4686 max mem: 20688
Global Epoch: [4] [2/7] eta: 0:00:04 lr: 0.000600 min_lr: 0.000030 loss: 4.2894 (4.8619) ce_loss: 4.2894 (4.8619) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.3645 (5.5754) time: 0.9091 data: 0.3124 max mem: 20688
Global Epoch: [4] [3/7] eta: 0:00:03 lr: 0.000600 min_lr: 0.000030 loss: 4.1064 (4.5241) ce_loss: 4.1064 (4.5241) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.3645 (5.5617) time: 0.8305 data: 0.2344 max mem: 20688
Global Epoch: [4] [4/7] eta: 0:00:02 lr: 0.000600 min_lr: 0.000030 loss: 4.2894 (4.7881) ce_loss: 4.2894 (4.7881) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.5208 (5.7364) time: 0.7836 data: 0.1875 max mem: 20688
Global Epoch: [4] [5/7] eta: 0:00:01 lr: 0.000600 min_lr: 0.000030 loss: 4.2894 (4.7641) ce_loss: 4.2894 (4.7641) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.5208 (5.7940) time: 0.7507 data: 0.1563 max mem: 20688
Global Epoch: [4] [6/7] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 4.2894 (4.6589) ce_loss: 4.2894 (4.6589) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 6.0819 (5.8655) time: 0.7280 data: 0.1340 max mem: 20688
Global Epoch: [4] Total time: 0:00:05 (0.7576 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 4.2894 (4.7009) ce_loss: 4.2894 (4.7009) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 6.0819 (5.8655)
Load ckpt from ./init_weight/beit3_base_patch16_224.pth
Load state_dict by model_key = model
Position interpolate from 14x14 to 30x30
Weights of BEiT3ForVisualQuestionAnswering not initialized from pretrained model: ['fusion.img_norm.weight', 'fusion.img_norm.bias', 'fusion.text_norm.weight', 'fusion.text_norm.bias', 'fusion.dense.weight', 'fusion.dense.bias', 'head.0.weight', 'head.0.bias', 'head.1.weight', 'head.1.bias', 'head.3.weight', 'head.3.bias']
Weights from pretrained model not used in BEiT3ForVisualQuestionAnswering: ['mlm_head.weight', 'mlm_head.bias', 'mim_head.weight', 'mim_head.bias', 'beit3.encoder.layer_norm.A.weight', 'beit3.encoder.layer_norm.A.bias', 'beit3.encoder.layer_norm.B.weight', 'beit3.encoder.layer_norm.B.bias']
Load 359 image-text pairs from ./data/modal-missing-non-iid/client17-img-text.jsonl.
Load 5000 image-text pairs from ./data/vqa.rest_val.jsonl.
Load 359 image-text pairs from ./data/modal-missing-non-iid/client17-img-text.jsonl.
Global Epoch: [0] [0/7] eta: 0:00:15 lr: 0.000600 min_lr: 0.000030 loss: 218.5943 (218.5943) ce_loss: 218.5943 (218.5943) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 32768.0000 (32768.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 2.1652 data: 1.5474 max mem: 20688
Global Epoch: [0] [1/7] eta: 0:00:08 lr: 0.000600 min_lr: 0.000030 loss: 217.6906 (218.1424) ce_loss: 217.6906 (218.1424) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (24576.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.3674 data: 0.7738 max mem: 20688
Global Epoch: [0] [2/7] eta: 0:00:05 lr: 0.000600 min_lr: 0.000030 loss: 218.1424 (218.1424) ce_loss: 218.1424 (218.1424) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (19114.6667) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.1032 data: 0.5159 max mem: 20688
Global Epoch: [0] [3/7] eta: 0:00:04 lr: 0.000600 min_lr: 0.000030 loss: 218.1424 (218.4266) ce_loss: 218.1424 (218.4266) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (15360.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.0150 data: 0.3869 max mem: 20688
Global Epoch: [0] [4/7] eta: 0:00:02 lr: 0.000600 min_lr: 0.000030 loss: 218.5943 (218.4881) ce_loss: 218.5943 (218.4881) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (12697.6000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.9285 data: 0.3096 max mem: 20688
Global Epoch: [0] [5/7] eta: 0:00:01 lr: 0.000600 min_lr: 0.000030 loss: 218.1424 (218.3958) ce_loss: 218.1424 (218.3958) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 4096.0000 (10922.6667) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.8734 data: 0.2580 max mem: 20688
Global Epoch: [0] [6/7] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 218.1424 (208.0763) ce_loss: 218.1424 (208.0763) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 4096.0000 (9654.8571) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.8336 data: 0.2211 max mem: 20688
Global Epoch: [0] Total time: 0:00:06 (0.8575 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 218.1424 (208.0300) ce_loss: 218.1424 (208.0300) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 4096.0000 (9654.8571) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan)
Global Epoch: [1] [0/7] eta: 0:00:11 lr: 0.000600 min_lr: 0.000030 loss: 95.0201 (95.0201) ce_loss: 95.0201 (95.0201) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 32768.0000 (32768.0000) weight_decay: 0.0001 (0.0001) grad_norm: inf (inf) time: 1.6556 data: 1.0849 max mem: 20688
Global Epoch: [1] [1/7] eta: 0:00:06 lr: 0.000600 min_lr: 0.000030 loss: 95.0201 (95.3561) ce_loss: 95.0201 (95.3561) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (24576.0000) weight_decay: 0.0001 (0.0001) grad_norm: inf (inf) time: 1.1113 data: 0.5425 max mem: 20688
Global Epoch: [1] [2/7] eta: 0:00:04 lr: 0.000600 min_lr: 0.000030 loss: 95.6921 (95.4817) ce_loss: 95.6921 (95.4817) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (19114.6667) weight_decay: 0.0001 (0.0001) grad_norm: inf (inf) time: 0.9296 data: 0.3617 max mem: 20688
Global Epoch: [1] [3/7] eta: 0:00:03 lr: 0.000600 min_lr: 0.000030 loss: 95.6921 (95.5851) ce_loss: 95.6921 (95.5851) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (16384.0000) weight_decay: 0.0001 (0.0001) grad_norm: inf (inf) time: 0.8471 data: 0.2713 max mem: 20688
Global Epoch: [1] [4/7] eta: 0:00:02 lr: 0.000600 min_lr: 0.000030 loss: 95.6921 (89.2722) ce_loss: 95.6921 (89.2722) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (14745.6000) weight_decay: 0.0001 (0.0001) grad_norm: inf (inf) time: 0.7955 data: 0.2170 max mem: 20688
Global Epoch: [1] [5/7] eta: 0:00:01 lr: 0.000600 min_lr: 0.000030 loss: 95.0201 (81.2781) ce_loss: 95.0201 (81.2781) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (13653.3333) weight_decay: 0.0001 (0.0001) grad_norm: 146.7662 (inf) time: 0.7615 data: 0.1809 max mem: 20688
Global Epoch: [1] [6/7] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 95.0201 (73.6258) ce_loss: 95.0201 (73.6258) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (12873.1429) weight_decay: 0.0001 (0.0001) grad_norm: 146.7662 (inf) time: 0.7372 data: 0.1551 max mem: 20688
Global Epoch: [1] Total time: 0:00:05 (0.7715 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 95.0201 (73.7733) ce_loss: 95.0201 (73.7733) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (12873.1429) weight_decay: 0.0001 (0.0001) grad_norm: 146.7662 (inf)
Global Epoch: [2] [0/7] eta: 0:00:11 lr: 0.000600 min_lr: 0.000030 loss: 19.7824 (19.7824) ce_loss: 19.7824 (19.7824) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 30.3173 (30.3173) time: 1.5812 data: 0.9808 max mem: 20688
Global Epoch: [2] [1/7] eta: 0:00:06 lr: 0.000600 min_lr: 0.000030 loss: 14.5544 (17.1684) ce_loss: 14.5544 (17.1684) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 21.4851 (25.9012) time: 1.0861 data: 0.4905 max mem: 20688
Global Epoch: [2] [2/7] eta: 0:00:04 lr: 0.000600 min_lr: 0.000030 loss: 14.5544 (14.8083) ce_loss: 14.5544 (14.8083) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 21.4851 (21.5537) time: 0.9214 data: 0.3270 max mem: 20688
Global Epoch: [2] [3/7] eta: 0:00:03 lr: 0.000600 min_lr: 0.000030 loss: 10.0880 (12.9549) ce_loss: 10.0880 (12.9549) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 12.8588 (18.1486) time: 0.8395 data: 0.2453 max mem: 20688
Global Epoch: [2] [4/7] eta: 0:00:02 lr: 0.000600 min_lr: 0.000030 loss: 10.0880 (12.0011) ce_loss: 10.0880 (12.0011) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 12.8588 (15.6265) time: 0.7896 data: 0.1962 max mem: 20688
Global Epoch: [2] [5/7] eta: 0:00:01 lr: 0.000600 min_lr: 0.000030 loss: 8.1861 (11.2029) ce_loss: 8.1861 (11.2029) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 7.9332 (13.8998) time: 0.7575 data: 0.1635 max mem: 20688
Global Epoch: [2] [6/7] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 8.1861 (10.5456) ce_loss: 8.1861 (10.5456) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 7.9332 (12.5942) time: 0.7338 data: 0.1402 max mem: 20688
Global Epoch: [2] Total time: 0:00:05 (0.7592 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 8.1861 (10.2791) ce_loss: 8.1861 (10.2791) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 7.9332 (12.5942)
Global Epoch: [3] [0/7] eta: 0:00:11 lr: 0.000600 min_lr: 0.000030 loss: 7.3481 (7.3481) ce_loss: 7.3481 (7.3481) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.0433 (5.0433) time: 1.6394 data: 1.0412 max mem: 20688
Global Epoch: [3] [1/7] eta: 0:00:06 lr: 0.000600 min_lr: 0.000030 loss: 5.6700 (6.5090) ce_loss: 5.6700 (6.5090) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.0433 (7.0181) time: 1.1179 data: 0.5206 max mem: 20688
Global Epoch: [3] [2/7] eta: 0:00:04 lr: 0.000600 min_lr: 0.000030 loss: 5.6700 (5.7509) ce_loss: 5.6700 (5.7509) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 8.9929 (7.7100) time: 0.9430 data: 0.3471 max mem: 20688
Global Epoch: [3] [3/7] eta: 0:00:03 lr: 0.000600 min_lr: 0.000030 loss: 4.6579 (5.4777) ce_loss: 4.6579 (5.4777) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.2847 (7.1037) time: 0.8564 data: 0.2604 max mem: 20688
Global Epoch: [3] [4/7] eta: 0:00:02 lr: 0.000600 min_lr: 0.000030 loss: 4.6579 (5.3037) ce_loss: 4.6579 (5.3037) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.2920 (6.7413) time: 0.8021 data: 0.2083 max mem: 20688
Global Epoch: [3] [5/7] eta: 0:00:01 lr: 0.000600 min_lr: 0.000030 loss: 4.6579 (5.2751) ce_loss: 4.6579 (5.2751) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.2847 (6.4172) time: 0.7672 data: 0.1736 max mem: 20688
Global Epoch: [3] [6/7] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 5.1320 (5.3420) ce_loss: 5.1320 (5.3420) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.2847 (6.1713) time: 0.7422 data: 0.1488 max mem: 20688
Global Epoch: [3] Total time: 0:00:05 (0.7663 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 5.1320 (5.7851) ce_loss: 5.1320 (5.7851) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.2847 (6.1713)
Global Epoch: [4] [0/7] eta: 0:00:11 lr: 0.000600 min_lr: 0.000030 loss: 5.5126 (5.5126) ce_loss: 5.5126 (5.5126) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 4.7203 (4.7203) time: 1.6391 data: 1.0418 max mem: 20688
Global Epoch: [4] [1/7] eta: 0:00:06 lr: 0.000600 min_lr: 0.000030 loss: 4.6149 (5.0638) ce_loss: 4.6149 (5.0638) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 4.7203 (5.9164) time: 1.1152 data: 0.5210 max mem: 20688
Global Epoch: [4] [2/7] eta: 0:00:04 lr: 0.000600 min_lr: 0.000030 loss: 5.5126 (5.5190) ce_loss: 5.5126 (5.5190) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 6.2344 (6.0224) time: 0.9405 data: 0.3473 max mem: 20688
Global Epoch: [4] [3/7] eta: 0:00:03 lr: 0.000600 min_lr: 0.000030 loss: 5.5126 (5.6324) ce_loss: 5.5126 (5.6324) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.5637 (5.9077) time: 0.8535 data: 0.2605 max mem: 20688
Global Epoch: [4] [4/7] eta: 0:00:02 lr: 0.000600 min_lr: 0.000030 loss: 5.5126 (5.5544) ce_loss: 5.5126 (5.5544) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.5637 (5.8137) time: 0.8013 data: 0.2084 max mem: 20688
Global Epoch: [4] [5/7] eta: 0:00:01 lr: 0.000600 min_lr: 0.000030 loss: 5.2424 (5.2477) ce_loss: 5.2424 (5.2477) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.4376 (5.4850) time: 0.7671 data: 0.1737 max mem: 20688
Global Epoch: [4] [6/7] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 5.5126 (5.3062) ce_loss: 5.5126 (5.3062) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.4376 (5.3677) time: 0.7434 data: 0.1489 max mem: 20688
Global Epoch: [4] Total time: 0:00:05 (0.7799 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 5.5126 (4.9296) ce_loss: 5.5126 (4.9296) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0001 (0.0001) grad_norm: 5.4376 (5.3677)
Load ckpt from ./init_weight/beit3_base_patch16_224.pth
Load state_dict by model_key = model
Position interpolate from 14x14 to 30x30
Weights of BEiT3ForVisualQuestionAnswering not initialized from pretrained model: ['fusion.img_norm.weight', 'fusion.img_norm.bias', 'fusion.text_norm.weight', 'fusion.text_norm.bias', 'fusion.dense.weight', 'fusion.dense.bias', 'head.0.weight', 'head.0.bias', 'head.1.weight', 'head.1.bias', 'head.3.weight', 'head.3.bias']
Weights from pretrained model not used in BEiT3ForVisualQuestionAnswering: ['mlm_head.weight', 'mlm_head.bias', 'mim_head.weight', 'mim_head.bias', 'beit3.encoder.layer_norm.A.weight', 'beit3.encoder.layer_norm.A.bias', 'beit3.encoder.layer_norm.B.weight', 'beit3.encoder.layer_norm.B.bias']
Load 203 image-text pairs from ./data/modal-missing-non-iid/client15-img-text.jsonl.
Load 5000 image-text pairs from ./data/vqa.rest_val.jsonl.
Load 203 image-text pairs from ./data/modal-missing-non-iid/client15-img-text.jsonl.
Global Epoch: [0] [0/4] eta: 0:00:07 lr: 0.000600 min_lr: 0.000030 loss: 220.0424 (220.0424) ce_loss: 220.0424 (220.0424) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 32768.0000 (32768.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.9733 data: 1.2767 max mem: 20688
Global Epoch: [0] [1/4] eta: 0:00:03 lr: 0.000600 min_lr: 0.000030 loss: 219.9564 (219.9994) ce_loss: 219.9564 (219.9994) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (24576.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.2723 data: 0.6384 max mem: 20688
Global Epoch: [0] [2/4] eta: 0:00:02 lr: 0.000600 min_lr: 0.000030 loss: 219.9564 (219.7878) ce_loss: 219.9564 (219.7878) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (19114.6667) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.0441 data: 0.4256 max mem: 20688
Global Epoch: [0] [3/4] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 219.8833 (219.8117) ce_loss: 219.8833 (219.8117) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (15360.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.9313 data: 0.3192 max mem: 20688
Global Epoch: [0] Total time: 0:00:04 (1.0079 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 219.8833 (219.9186) ce_loss: 219.8833 (219.9186) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (15360.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan)
Global Epoch: [1] [0/4] eta: 0:00:07 lr: 0.000600 min_lr: 0.000030 loss: 219.3388 (219.3388) ce_loss: 219.3388 (219.3388) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 32768.0000 (32768.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.7811 data: 0.9837 max mem: 20688
Global Epoch: [1] [1/4] eta: 0:00:03 lr: 0.000600 min_lr: 0.000030 loss: 219.3388 (220.1800) ce_loss: 219.3388 (220.1800) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (24576.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.1746 data: 0.4919 max mem: 20688
Global Epoch: [1] [2/4] eta: 0:00:01 lr: 0.000600 min_lr: 0.000030 loss: 219.5009 (219.9536) ce_loss: 219.5009 (219.9536) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (19114.6667) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.9735 data: 0.3280 max mem: 20688
Global Epoch: [1] [3/4] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 219.3388 (219.7400) ce_loss: 219.3388 (219.7400) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (15360.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.8729 data: 0.2460 max mem: 20688
Global Epoch: [1] Total time: 0:00:03 (0.9504 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 219.3388 (219.7564) ce_loss: 219.3388 (219.7564) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (15360.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan)
Global Epoch: [2] [0/4] eta: 0:00:06 lr: 0.000600 min_lr: 0.000030 loss: 219.1128 (219.1128) ce_loss: 219.1128 (219.1128) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 32768.0000 (32768.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.5478 data: 0.9429 max mem: 20688
Global Epoch: [2] [1/4] eta: 0:00:03 lr: 0.000600 min_lr: 0.000030 loss: 219.1128 (219.8755) ce_loss: 219.1128 (219.8755) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (24576.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.0600 data: 0.4715 max mem: 20688
Global Epoch: [2] [2/4] eta: 0:00:01 lr: 0.000600 min_lr: 0.000030 loss: 219.1128 (219.4936) ce_loss: 219.1128 (219.4936) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (19114.6667) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.8964 data: 0.3144 max mem: 20688
Global Epoch: [2] [3/4] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 219.1128 (219.7707) ce_loss: 219.1128 (219.7707) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (15360.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.8148 data: 0.2358 max mem: 20688
Global Epoch: [2] Total time: 0:00:03 (0.8918 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 219.1128 (219.6505) ce_loss: 219.1128 (219.6505) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (15360.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan)
Global Epoch: [3] [0/4] eta: 0:00:06 lr: 0.000600 min_lr: 0.000030 loss: 220.1984 (220.1984) ce_loss: 220.1984 (220.1984) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 32768.0000 (32768.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.5722 data: 0.9653 max mem: 20688
Global Epoch: [3] [1/4] eta: 0:00:03 lr: 0.000600 min_lr: 0.000030 loss: 220.1984 (220.3004) ce_loss: 220.1984 (220.3004) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (24576.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.0714 data: 0.4827 max mem: 20688
Global Epoch: [3] [2/4] eta: 0:00:01 lr: 0.000600 min_lr: 0.000030 loss: 220.1984 (220.1004) ce_loss: 220.1984 (220.1004) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (19114.6667) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.9056 data: 0.3218 max mem: 20688
Global Epoch: [3] [3/4] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 220.0198 (220.0802) ce_loss: 220.0198 (220.0802) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (15360.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.8223 data: 0.2414 max mem: 20688
Global Epoch: [3] Total time: 0:00:03 (0.8977 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 220.0198 (219.9062) ce_loss: 220.0198 (219.9062) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (15360.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan)
Global Epoch: [4] [0/4] eta: 0:00:06 lr: 0.000600 min_lr: 0.000030 loss: 219.9427 (219.9427) ce_loss: 219.9427 (219.9427) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 32768.0000 (32768.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.5965 data: 1.0258 max mem: 20688
Global Epoch: [4] [1/4] eta: 0:00:03 lr: 0.000600 min_lr: 0.000030 loss: 218.9389 (219.4408) ce_loss: 218.9389 (219.4408) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (24576.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 1.0849 data: 0.5130 max mem: 20688
Global Epoch: [4] [2/4] eta: 0:00:01 lr: 0.000600 min_lr: 0.000030 loss: 219.9427 (219.7398) ce_loss: 219.9427 (219.7398) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 16384.0000 (19114.6667) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.9119 data: 0.3420 max mem: 20688
Global Epoch: [4] [3/4] eta: 0:00:00 lr: 0.000600 min_lr: 0.000030 loss: 219.5852 (219.7011) ce_loss: 219.5852 (219.7011) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (15360.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan) time: 0.8280 data: 0.2565 max mem: 20688
Global Epoch: [4] Total time: 0:00:03 (0.9037 s / it)
Averaged stats: lr: 0.000600 min_lr: 0.000030 loss: 219.5852 (219.7801) ce_loss: 219.5852 (219.7801) clip_loss_i: nan (nan) clip_loss_t: nan (nan) clip_loss_f: nan (nan) loss_scale: 8192.0000 (15360.0000) weight_decay: 0.0001 (0.0001) grad_norm: nan (nan)
[rank3]: Traceback (most recent call last):
[rank3]: File "/data/home/yuqiwei/Multimodal_Federated/src/main.py", line 197, in <module>
[rank3]: main(opts)
[rank3]: File "/data/home/yuqiwei/Multimodal_Federated/src/main.py", line 186, in main
[rank3]: algo.run(
[rank3]: File "/data/home/yuqiwei/Multimodal_Federated/src/algorithm/MMFL.py", line 78, in run
[rank3]: steps_per_epoch = self.sample_num_dict[client_id]["total"] // global_batch_size
[rank3]: KeyError: 34
[rank2]: Traceback (most recent call last):
[rank2]: File "/data/home/yuqiwei/Multimodal_Federated/src/main.py", line 197, in <module>
[rank2]: main(opts)
[rank2]: File "/data/home/yuqiwei/Multimodal_Federated/src/main.py", line 186, in main
[rank2]: algo.run(
[rank2]: File "/data/home/yuqiwei/Multimodal_Federated/src/algorithm/MMFL.py", line 78, in run
[rank2]: steps_per_epoch = self.sample_num_dict[client_id]["total"] // global_batch_size
[rank2]: KeyError: 34
[rank1]: Traceback (most recent call last):
[rank1]: File "/data/home/yuqiwei/Multimodal_Federated/src/main.py", line 197, in <module>
[rank1]: main(opts)
[rank1]: File "/data/home/yuqiwei/Multimodal_Federated/src/main.py", line 186, in main
[rank1]: algo.run(
[rank1]: File "/data/home/yuqiwei/Multimodal_Federated/src/algorithm/MMFL.py", line 78, in run
[rank1]: steps_per_epoch = self.sample_num_dict[client_id]["total"] // global_batch_size
[rank1]: KeyError: 34
Load ckpt from ./init_weight/beit3_base_patch16_224.pth
Load state_dict by model_key = model
Position interpolate from 14x14 to 30x30
W1121 19:12:02.257825 856815 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 856891 closing signal SIGTERM
W1121 19:12:02.261298 856815 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 856892 closing signal SIGTERM
W1121 19:12:02.261759 856815 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 856893 closing signal SIGTERM
E1121 19:12:02.827662 856815 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 3 (pid: 856896) of binary: /data/home/yuqiwei/.conda/envs/pytorch/bin/python
Traceback (most recent call last):
File "/data/home/yuqiwei/.conda/envs/pytorch/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data/home/yuqiwei/.conda/envs/pytorch/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/home/yuqiwei/.conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 923, in <module>
main()
File "/data/home/yuqiwei/.conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/data/home/yuqiwei/.conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/data/home/yuqiwei/.conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/data/home/yuqiwei/.conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/home/yuqiwei/.conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-11-21_19:12:02
host : hello-DSS8440
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 856896)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
这两天找到原因了,目前没什么问题了
跑了几遍修了一些bug,但还是没跑通😢