Noirebao / Multimodal_Federated

Official Implementation of paper "Multimodal Federated Learning with Missing Modality via Prototype Mask and Contrast"
MIT License
5 stars 0 forks source link

请问代码有更新吗 #1

Closed yongrenx closed 1 week ago

yongrenx commented 1 week ago

跑了几遍修了一些bug,但还是没跑通😢

Noirebao commented 1 week ago

代码已是最新的,你可以尝试描述遇到的问题,我看看能不能解决

yongrenx commented 1 week ago

可以帮我看看有什么问题吗

| distributed init (rank 0): env://, gpu 0
[rank0]:[W1121 19:09:41.973342975 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
| distributed init (rank 3): env://, gpu 3
[rank3]:[W1121 19:09:41.177365832 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
| distributed init (rank 2): env://, gpu 2
| distributed init (rank 1): env://, gpu 1
[rank2]:[W1121 19:09:42.301023264 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank1]:[W1121 19:09:42.307820029 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
Load ckpt from ./init_weight/beit3_base_patch16_224.pth
Load state_dict by model_key = model
Position interpolate from 14x14 to 30x30
Weights of BEiT3ForVisualQuestionAnswering not initialized from pretrained model: ['fusion.img_norm.weight', 'fusion.img_norm.bias', 'fusion.text_norm.weight', 'fusion.text_norm.bias', 'fusion.dense.weight', 'fusion.dense.bias', 'head.0.weight', 'head.0.bias', 'head.1.weight', 'head.1.bias', 'head.3.weight', 'head.3.bias']
Weights from pretrained model not used in BEiT3ForVisualQuestionAnswering: ['mlm_head.weight', 'mlm_head.bias', 'mim_head.weight', 'mim_head.bias', 'beit3.encoder.layer_norm.A.weight', 'beit3.encoder.layer_norm.A.bias', 'beit3.encoder.layer_norm.B.weight', 'beit3.encoder.layer_norm.B.bias']
Load 255 image-text pairs from ./data/modal-missing-non-iid/client7-img-text.jsonl. 
Load 5000 image-text pairs from ./data/vqa.rest_val.jsonl. 
Load 255 image-text pairs from ./data/modal-missing-non-iid/client7-img-text.jsonl. 
Global Epoch: [0]  [0/5]  eta: 0:00:12  lr: 0.000600  min_lr: 0.000030  loss: 215.4926 (215.4926)  ce_loss: 215.4926 (215.4926)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 32768.0000 (32768.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 2.4047  data: 0.7685  max mem: 18964
Global Epoch: [0]  [1/5]  eta: 0:00:05  lr: 0.000600  min_lr: 0.000030  loss: 215.2807 (215.3866)  ce_loss: 215.2807 (215.3866)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (24576.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.4926  data: 0.3843  max mem: 18965
Global Epoch: [0]  [2/5]  eta: 0:00:03  lr: 0.000600  min_lr: 0.000030  loss: 215.4926 (215.5178)  ce_loss: 215.4926 (215.5178)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (19114.6667)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.1871  data: 0.2562  max mem: 18965
Global Epoch: [0]  [3/5]  eta: 0:00:02  lr: 0.000600  min_lr: 0.000030  loss: 215.4681 (215.5054)  ce_loss: 215.4681 (215.5054)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (15360.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.0325  data: 0.1922  max mem: 18965
Global Epoch: [0]  [4/5]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 215.4681 (215.3552)  ce_loss: 215.4681 (215.3552)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (12697.6000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.9396  data: 0.1538  max mem: 18965
Global Epoch: [0] Total time: 0:00:04 (0.9643 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 215.4681 (215.1745)  ce_loss: 215.4681 (215.1745)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (12697.6000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)
Global Epoch: [1]  [0/5]  eta: 0:00:07  lr: 0.000600  min_lr: 0.000030  loss: 215.1533 (215.1533)  ce_loss: 215.1533 (215.1533)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 32768.0000 (32768.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.4513  data: 0.7296  max mem: 18965
Global Epoch: [1]  [1/5]  eta: 0:00:04  lr: 0.000600  min_lr: 0.000030  loss: 214.8118 (214.9826)  ce_loss: 214.8118 (214.9826)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (24576.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.0181  data: 0.3649  max mem: 18965
Global Epoch: [1]  [2/5]  eta: 0:00:02  lr: 0.000600  min_lr: 0.000030  loss: 214.8118 (214.9155)  ce_loss: 214.8118 (214.9155)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (19114.6667)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.8760  data: 0.2433  max mem: 18965
Global Epoch: [1]  [3/5]  eta: 0:00:01  lr: 0.000600  min_lr: 0.000030  loss: 214.8118 (214.9168)  ce_loss: 214.8118 (214.9168)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (15360.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.8027  data: 0.1825  max mem: 18965
Global Epoch: [1]  [4/5]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 214.9208 (214.9445)  ce_loss: 214.9208 (214.9445)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (12697.6000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.7564  data: 0.1460  max mem: 18965
Global Epoch: [1] Total time: 0:00:04 (0.8083 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 214.9208 (215.2461)  ce_loss: 214.9208 (215.2461)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (12697.6000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)
Global Epoch: [2]  [0/5]  eta: 0:00:07  lr: 0.000600  min_lr: 0.000030  loss: 215.2972 (215.2972)  ce_loss: 215.2972 (215.2972)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 32768.0000 (32768.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.5053  data: 0.9381  max mem: 18965
Global Epoch: [2]  [1/5]  eta: 0:00:04  lr: 0.000600  min_lr: 0.000030  loss: 215.2972 (215.5354)  ce_loss: 215.2972 (215.5354)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (24576.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.0682  data: 0.4691  max mem: 18965
Global Epoch: [2]  [2/5]  eta: 0:00:02  lr: 0.000600  min_lr: 0.000030  loss: 215.2972 (215.2773)  ce_loss: 215.2972 (215.2773)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (19114.6667)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.9010  data: 0.3128  max mem: 18965
Global Epoch: [2]  [3/5]  eta: 0:00:01  lr: 0.000600  min_lr: 0.000030  loss: 215.2972 (215.3691)  ce_loss: 215.2972 (215.3691)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (15360.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.8172  data: 0.2346  max mem: 18965
Global Epoch: [2]  [4/5]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 215.6109 (215.4174)  ce_loss: 215.6109 (215.4174)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (12697.6000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.7681  data: 0.1877  max mem: 18965
Global Epoch: [2] Total time: 0:00:04 (0.8123 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 215.6109 (215.2190)  ce_loss: 215.6109 (215.2190)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (12697.6000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)
Global Epoch: [3]  [0/5]  eta: 0:00:06  lr: 0.000600  min_lr: 0.000030  loss: 214.8503 (214.8503)  ce_loss: 214.8503 (214.8503)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 32768.0000 (32768.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.3878  data: 0.7353  max mem: 18965
Global Epoch: [3]  [1/5]  eta: 0:00:03  lr: 0.000600  min_lr: 0.000030  loss: 214.8503 (215.1706)  ce_loss: 214.8503 (215.1706)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (24576.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.9855  data: 0.3677  max mem: 18965
Global Epoch: [3]  [2/5]  eta: 0:00:02  lr: 0.000600  min_lr: 0.000030  loss: 214.8708 (215.0707)  ce_loss: 214.8708 (215.0707)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (19114.6667)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.8454  data: 0.2451  max mem: 18965
Global Epoch: [3]  [3/5]  eta: 0:00:01  lr: 0.000600  min_lr: 0.000030  loss: 214.8708 (215.1463)  ce_loss: 214.8708 (215.1463)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (15360.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.7753  data: 0.1839  max mem: 18965
Global Epoch: [3]  [4/5]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 214.9007 (215.0972)  ce_loss: 214.9007 (215.0972)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (12697.6000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.7330  data: 0.1471  max mem: 18965
Global Epoch: [3] Total time: 0:00:03 (0.7821 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 214.9007 (215.1539)  ce_loss: 214.9007 (215.1539)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (12697.6000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)
Global Epoch: [4]  [0/5]  eta: 0:00:06  lr: 0.000600  min_lr: 0.000030  loss: 215.2302 (215.2302)  ce_loss: 215.2302 (215.2302)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 32768.0000 (32768.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.3065  data: 0.7379  max mem: 18965
Global Epoch: [4]  [1/5]  eta: 0:00:03  lr: 0.000600  min_lr: 0.000030  loss: 215.2302 (215.3452)  ce_loss: 215.2302 (215.3452)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (24576.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.9420  data: 0.3690  max mem: 18965
Global Epoch: [4]  [2/5]  eta: 0:00:02  lr: 0.000600  min_lr: 0.000030  loss: 215.2302 (215.2216)  ce_loss: 215.2302 (215.2216)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (19114.6667)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.8176  data: 0.2460  max mem: 18965
Global Epoch: [4]  [3/5]  eta: 0:00:01  lr: 0.000600  min_lr: 0.000030  loss: 215.2302 (215.2778)  ce_loss: 215.2302 (215.2778)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (15360.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.7553  data: 0.1845  max mem: 18965
Global Epoch: [4]  [4/5]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 215.4464 (215.3375)  ce_loss: 215.4464 (215.3375)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (12697.6000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.7179  data: 0.1476  max mem: 18965
Global Epoch: [4] Total time: 0:00:03 (0.7667 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 215.4464 (215.2650)  ce_loss: 215.4464 (215.2650)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (12697.6000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)
Load ckpt from ./init_weight/beit3_base_patch16_224.pth
Load state_dict by model_key = model
Position interpolate from 14x14 to 30x30
Weights of BEiT3ForVisualQuestionAnswering not initialized from pretrained model: ['fusion.img_norm.weight', 'fusion.img_norm.bias', 'fusion.text_norm.weight', 'fusion.text_norm.bias', 'fusion.dense.weight', 'fusion.dense.bias', 'head.0.weight', 'head.0.bias', 'head.1.weight', 'head.1.bias', 'head.3.weight', 'head.3.bias']
Weights from pretrained model not used in BEiT3ForVisualQuestionAnswering: ['mlm_head.weight', 'mlm_head.bias', 'mim_head.weight', 'mim_head.bias', 'beit3.encoder.layer_norm.A.weight', 'beit3.encoder.layer_norm.A.bias', 'beit3.encoder.layer_norm.B.weight', 'beit3.encoder.layer_norm.B.bias']
Load 365 image-text pairs from ./data/modal-missing-non-iid/client1-img-text.jsonl. 
Load 5000 image-text pairs from ./data/vqa.rest_val.jsonl. 
Load 365 image-text pairs from ./data/modal-missing-non-iid/client1-img-text.jsonl. 
Global Epoch: [0]  [0/7]  eta: 0:00:14  lr: 0.000600  min_lr: 0.000030  loss: 221.6281 (221.6281)  ce_loss: 221.6281 (221.6281)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 32768.0000 (32768.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 2.1297  data: 1.4842  max mem: 18965
Global Epoch: [0]  [1/7]  eta: 0:00:08  lr: 0.000600  min_lr: 0.000030  loss: 221.6281 (221.7444)  ce_loss: 221.6281 (221.7444)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (24576.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.3509  data: 0.7421  max mem: 18965
Global Epoch: [0]  [2/7]  eta: 0:00:05  lr: 0.000600  min_lr: 0.000030  loss: 221.6281 (221.6842)  ce_loss: 221.6281 (221.6842)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (19114.6667)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.0941  data: 0.4948  max mem: 18965
Global Epoch: [0]  [3/7]  eta: 0:00:03  lr: 0.000600  min_lr: 0.000030  loss: 221.5824 (221.6588)  ce_loss: 221.5824 (221.6588)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (15360.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.9633  data: 0.3711  max mem: 18965
Global Epoch: [0]  [4/7]  eta: 0:00:02  lr: 0.000600  min_lr: 0.000030  loss: 221.5824 (221.6075)  ce_loss: 221.5824 (221.6075)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (12697.6000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.8838  data: 0.2969  max mem: 18965
Global Epoch: [0]  [5/7]  eta: 0:00:01  lr: 0.000600  min_lr: 0.000030  loss: 221.5638 (221.5924)  ce_loss: 221.5638 (221.5924)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 4096.0000 (10922.6667)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.8485  data: 0.2474  max mem: 18965
Global Epoch: [0]  [6/7]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 221.5638 (211.1339)  ce_loss: 221.5638 (211.1339)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 4096.0000 (9654.8571)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.8108  data: 0.2121  max mem: 20688
Global Epoch: [0] Total time: 0:00:05 (0.8460 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 221.5638 (211.0493)  ce_loss: 221.5638 (211.0493)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 4096.0000 (9654.8571)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)
Global Epoch: [1]  [0/7]  eta: 0:00:10  lr: 0.000600  min_lr: 0.000030  loss: 97.1743 (97.1743)  ce_loss: 97.1743 (97.1743)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 32768.0000 (32768.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: inf (inf)  time: 1.5370  data: 0.9114  max mem: 20688
Global Epoch: [1]  [1/7]  eta: 0:00:06  lr: 0.000600  min_lr: 0.000030  loss: 97.1743 (97.6943)  ce_loss: 97.1743 (97.6943)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (24576.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: inf (inf)  time: 1.0528  data: 0.4558  max mem: 20688
Global Epoch: [1]  [2/7]  eta: 0:00:04  lr: 0.000600  min_lr: 0.000030  loss: 97.9733 (97.7873)  ce_loss: 97.9733 (97.7873)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (19114.6667)  weight_decay: 0.0001 (0.0001)  grad_norm: inf (inf)  time: 0.8931  data: 0.3039  max mem: 20688
Global Epoch: [1]  [3/7]  eta: 0:00:03  lr: 0.000600  min_lr: 0.000030  loss: 97.9733 (97.8713)  ce_loss: 97.9733 (97.8713)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (16384.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: inf (inf)  time: 0.8184  data: 0.2279  max mem: 20688
Global Epoch: [1]  [4/7]  eta: 0:00:02  lr: 0.000600  min_lr: 0.000030  loss: 97.9733 (91.0852)  ce_loss: 97.9733 (91.0852)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (14745.6000)  weight_decay: 0.0001 (0.0001)  grad_norm: inf (inf)  time: 0.7727  data: 0.1824  max mem: 20688
Global Epoch: [1]  [5/7]  eta: 0:00:01  lr: 0.000600  min_lr: 0.000030  loss: 97.1743 (82.7252)  ce_loss: 97.1743 (82.7252)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (13653.3333)  weight_decay: 0.0001 (0.0001)  grad_norm: 147.4982 (inf)  time: 0.7426  data: 0.1520  max mem: 20688
Global Epoch: [1]  [6/7]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 97.1743 (74.9574)  ce_loss: 97.1743 (74.9574)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (12873.1429)  weight_decay: 0.0001 (0.0001)  grad_norm: 147.4982 (inf)  time: 0.7216  data: 0.1303  max mem: 20688
Global Epoch: [1] Total time: 0:00:05 (0.7431 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 97.1743 (75.0668)  ce_loss: 97.1743 (75.0668)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (12873.1429)  weight_decay: 0.0001 (0.0001)  grad_norm: 147.4982 (inf)
Global Epoch: [2]  [0/7]  eta: 0:00:10  lr: 0.000600  min_lr: 0.000030  loss: 19.9189 (19.9189)  ce_loss: 19.9189 (19.9189)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 29.5659 (29.5659)  time: 1.5648  data: 0.9713  max mem: 20688
Global Epoch: [2]  [1/7]  eta: 0:00:06  lr: 0.000600  min_lr: 0.000030  loss: 15.1009 (17.5099)  ce_loss: 15.1009 (17.5099)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 21.9824 (25.7742)  time: 1.0788  data: 0.4857  max mem: 20688
Global Epoch: [2]  [2/7]  eta: 0:00:04  lr: 0.000600  min_lr: 0.000030  loss: 15.1009 (15.5517)  ce_loss: 15.1009 (15.5517)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 21.9824 (21.2771)  time: 0.9159  data: 0.3238  max mem: 20688
Global Epoch: [2]  [3/7]  eta: 0:00:03  lr: 0.000600  min_lr: 0.000030  loss: 11.6354 (13.8066)  ce_loss: 11.6354 (13.8066)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 12.2829 (18.1327)  time: 0.8343  data: 0.2429  max mem: 20688
Global Epoch: [2]  [4/7]  eta: 0:00:02  lr: 0.000600  min_lr: 0.000030  loss: 11.6354 (12.4827)  ce_loss: 11.6354 (12.4827)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 12.2829 (15.7667)  time: 0.7859  data: 0.1943  max mem: 20688
Global Epoch: [2]  [5/7]  eta: 0:00:01  lr: 0.000600  min_lr: 0.000030  loss: 8.5711 (11.3981)  ce_loss: 8.5711 (11.3981)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 8.6997 (14.0813)  time: 0.7550  data: 0.1620  max mem: 20688
Global Epoch: [2]  [6/7]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 8.5711 (10.7365)  ce_loss: 8.5711 (10.7365)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 8.6997 (13.1349)  time: 0.7315  data: 0.1388  max mem: 20688
Global Epoch: [2] Total time: 0:00:05 (0.7528 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 8.5711 (10.2501)  ce_loss: 8.5711 (10.2501)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 8.6997 (13.1349)
Global Epoch: [3]  [0/7]  eta: 0:00:12  lr: 0.000600  min_lr: 0.000030  loss: 5.4811 (5.4811)  ce_loss: 5.4811 (5.4811)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.6083 (5.6083)  time: 1.7163  data: 1.0286  max mem: 20688
Global Epoch: [3]  [1/7]  eta: 0:00:06  lr: 0.000600  min_lr: 0.000030  loss: 5.4811 (5.7907)  ce_loss: 5.4811 (5.7907)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.6083 (6.8227)  time: 1.1543  data: 0.5144  max mem: 20688
Global Epoch: [3]  [2/7]  eta: 0:00:04  lr: 0.000600  min_lr: 0.000030  loss: 6.1004 (6.3381)  ce_loss: 6.1004 (6.3381)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 8.0372 (7.2595)  time: 0.9676  data: 0.3429  max mem: 20688
Global Epoch: [3]  [3/7]  eta: 0:00:03  lr: 0.000600  min_lr: 0.000030  loss: 5.4811 (5.8912)  ce_loss: 5.4811 (5.8912)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.6083 (6.6321)  time: 0.8742  data: 0.2572  max mem: 20688
Global Epoch: [3]  [4/7]  eta: 0:00:02  lr: 0.000600  min_lr: 0.000030  loss: 5.4811 (5.7098)  ce_loss: 5.4811 (5.7098)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 7.8914 (6.8839)  time: 0.8178  data: 0.2058  max mem: 20688
Global Epoch: [3]  [5/7]  eta: 0:00:01  lr: 0.000600  min_lr: 0.000030  loss: 4.9838 (5.4080)  ce_loss: 4.9838 (5.4080)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.6155 (6.6725)  time: 0.7801  data: 0.1715  max mem: 20688
Global Epoch: [3]  [6/7]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 4.9838 (5.2308)  ce_loss: 4.9838 (5.2308)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 6.4635 (6.6427)  time: 0.7534  data: 0.1470  max mem: 20688
Global Epoch: [3] Total time: 0:00:05 (0.7813 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 4.9838 (5.5885)  ce_loss: 4.9838 (5.5885)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 6.4635 (6.6427)
Global Epoch: [4]  [0/7]  eta: 0:00:10  lr: 0.000600  min_lr: 0.000030  loss: 6.1900 (6.1900)  ce_loss: 6.1900 (6.1900)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 4.2300 (4.2300)  time: 1.5441  data: 0.9371  max mem: 20688
Global Epoch: [4]  [1/7]  eta: 0:00:06  lr: 0.000600  min_lr: 0.000030  loss: 4.1064 (5.1482)  ce_loss: 4.1064 (5.1482)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 4.2300 (5.6808)  time: 1.0679  data: 0.4686  max mem: 20688
Global Epoch: [4]  [2/7]  eta: 0:00:04  lr: 0.000600  min_lr: 0.000030  loss: 4.2894 (4.8619)  ce_loss: 4.2894 (4.8619)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.3645 (5.5754)  time: 0.9091  data: 0.3124  max mem: 20688
Global Epoch: [4]  [3/7]  eta: 0:00:03  lr: 0.000600  min_lr: 0.000030  loss: 4.1064 (4.5241)  ce_loss: 4.1064 (4.5241)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.3645 (5.5617)  time: 0.8305  data: 0.2344  max mem: 20688
Global Epoch: [4]  [4/7]  eta: 0:00:02  lr: 0.000600  min_lr: 0.000030  loss: 4.2894 (4.7881)  ce_loss: 4.2894 (4.7881)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.5208 (5.7364)  time: 0.7836  data: 0.1875  max mem: 20688
Global Epoch: [4]  [5/7]  eta: 0:00:01  lr: 0.000600  min_lr: 0.000030  loss: 4.2894 (4.7641)  ce_loss: 4.2894 (4.7641)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.5208 (5.7940)  time: 0.7507  data: 0.1563  max mem: 20688
Global Epoch: [4]  [6/7]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 4.2894 (4.6589)  ce_loss: 4.2894 (4.6589)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 6.0819 (5.8655)  time: 0.7280  data: 0.1340  max mem: 20688
Global Epoch: [4] Total time: 0:00:05 (0.7576 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 4.2894 (4.7009)  ce_loss: 4.2894 (4.7009)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 6.0819 (5.8655)
Load ckpt from ./init_weight/beit3_base_patch16_224.pth
Load state_dict by model_key = model
Position interpolate from 14x14 to 30x30
Weights of BEiT3ForVisualQuestionAnswering not initialized from pretrained model: ['fusion.img_norm.weight', 'fusion.img_norm.bias', 'fusion.text_norm.weight', 'fusion.text_norm.bias', 'fusion.dense.weight', 'fusion.dense.bias', 'head.0.weight', 'head.0.bias', 'head.1.weight', 'head.1.bias', 'head.3.weight', 'head.3.bias']
Weights from pretrained model not used in BEiT3ForVisualQuestionAnswering: ['mlm_head.weight', 'mlm_head.bias', 'mim_head.weight', 'mim_head.bias', 'beit3.encoder.layer_norm.A.weight', 'beit3.encoder.layer_norm.A.bias', 'beit3.encoder.layer_norm.B.weight', 'beit3.encoder.layer_norm.B.bias']
Load 359 image-text pairs from ./data/modal-missing-non-iid/client17-img-text.jsonl. 
Load 5000 image-text pairs from ./data/vqa.rest_val.jsonl. 
Load 359 image-text pairs from ./data/modal-missing-non-iid/client17-img-text.jsonl. 
Global Epoch: [0]  [0/7]  eta: 0:00:15  lr: 0.000600  min_lr: 0.000030  loss: 218.5943 (218.5943)  ce_loss: 218.5943 (218.5943)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 32768.0000 (32768.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 2.1652  data: 1.5474  max mem: 20688
Global Epoch: [0]  [1/7]  eta: 0:00:08  lr: 0.000600  min_lr: 0.000030  loss: 217.6906 (218.1424)  ce_loss: 217.6906 (218.1424)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (24576.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.3674  data: 0.7738  max mem: 20688
Global Epoch: [0]  [2/7]  eta: 0:00:05  lr: 0.000600  min_lr: 0.000030  loss: 218.1424 (218.1424)  ce_loss: 218.1424 (218.1424)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (19114.6667)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.1032  data: 0.5159  max mem: 20688
Global Epoch: [0]  [3/7]  eta: 0:00:04  lr: 0.000600  min_lr: 0.000030  loss: 218.1424 (218.4266)  ce_loss: 218.1424 (218.4266)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (15360.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.0150  data: 0.3869  max mem: 20688
Global Epoch: [0]  [4/7]  eta: 0:00:02  lr: 0.000600  min_lr: 0.000030  loss: 218.5943 (218.4881)  ce_loss: 218.5943 (218.4881)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (12697.6000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.9285  data: 0.3096  max mem: 20688
Global Epoch: [0]  [5/7]  eta: 0:00:01  lr: 0.000600  min_lr: 0.000030  loss: 218.1424 (218.3958)  ce_loss: 218.1424 (218.3958)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 4096.0000 (10922.6667)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.8734  data: 0.2580  max mem: 20688
Global Epoch: [0]  [6/7]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 218.1424 (208.0763)  ce_loss: 218.1424 (208.0763)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 4096.0000 (9654.8571)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.8336  data: 0.2211  max mem: 20688
Global Epoch: [0] Total time: 0:00:06 (0.8575 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 218.1424 (208.0300)  ce_loss: 218.1424 (208.0300)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 4096.0000 (9654.8571)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)
Global Epoch: [1]  [0/7]  eta: 0:00:11  lr: 0.000600  min_lr: 0.000030  loss: 95.0201 (95.0201)  ce_loss: 95.0201 (95.0201)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 32768.0000 (32768.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: inf (inf)  time: 1.6556  data: 1.0849  max mem: 20688
Global Epoch: [1]  [1/7]  eta: 0:00:06  lr: 0.000600  min_lr: 0.000030  loss: 95.0201 (95.3561)  ce_loss: 95.0201 (95.3561)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (24576.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: inf (inf)  time: 1.1113  data: 0.5425  max mem: 20688
Global Epoch: [1]  [2/7]  eta: 0:00:04  lr: 0.000600  min_lr: 0.000030  loss: 95.6921 (95.4817)  ce_loss: 95.6921 (95.4817)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (19114.6667)  weight_decay: 0.0001 (0.0001)  grad_norm: inf (inf)  time: 0.9296  data: 0.3617  max mem: 20688
Global Epoch: [1]  [3/7]  eta: 0:00:03  lr: 0.000600  min_lr: 0.000030  loss: 95.6921 (95.5851)  ce_loss: 95.6921 (95.5851)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (16384.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: inf (inf)  time: 0.8471  data: 0.2713  max mem: 20688
Global Epoch: [1]  [4/7]  eta: 0:00:02  lr: 0.000600  min_lr: 0.000030  loss: 95.6921 (89.2722)  ce_loss: 95.6921 (89.2722)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (14745.6000)  weight_decay: 0.0001 (0.0001)  grad_norm: inf (inf)  time: 0.7955  data: 0.2170  max mem: 20688
Global Epoch: [1]  [5/7]  eta: 0:00:01  lr: 0.000600  min_lr: 0.000030  loss: 95.0201 (81.2781)  ce_loss: 95.0201 (81.2781)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (13653.3333)  weight_decay: 0.0001 (0.0001)  grad_norm: 146.7662 (inf)  time: 0.7615  data: 0.1809  max mem: 20688
Global Epoch: [1]  [6/7]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 95.0201 (73.6258)  ce_loss: 95.0201 (73.6258)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (12873.1429)  weight_decay: 0.0001 (0.0001)  grad_norm: 146.7662 (inf)  time: 0.7372  data: 0.1551  max mem: 20688
Global Epoch: [1] Total time: 0:00:05 (0.7715 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 95.0201 (73.7733)  ce_loss: 95.0201 (73.7733)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (12873.1429)  weight_decay: 0.0001 (0.0001)  grad_norm: 146.7662 (inf)
Global Epoch: [2]  [0/7]  eta: 0:00:11  lr: 0.000600  min_lr: 0.000030  loss: 19.7824 (19.7824)  ce_loss: 19.7824 (19.7824)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 30.3173 (30.3173)  time: 1.5812  data: 0.9808  max mem: 20688
Global Epoch: [2]  [1/7]  eta: 0:00:06  lr: 0.000600  min_lr: 0.000030  loss: 14.5544 (17.1684)  ce_loss: 14.5544 (17.1684)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 21.4851 (25.9012)  time: 1.0861  data: 0.4905  max mem: 20688
Global Epoch: [2]  [2/7]  eta: 0:00:04  lr: 0.000600  min_lr: 0.000030  loss: 14.5544 (14.8083)  ce_loss: 14.5544 (14.8083)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 21.4851 (21.5537)  time: 0.9214  data: 0.3270  max mem: 20688
Global Epoch: [2]  [3/7]  eta: 0:00:03  lr: 0.000600  min_lr: 0.000030  loss: 10.0880 (12.9549)  ce_loss: 10.0880 (12.9549)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 12.8588 (18.1486)  time: 0.8395  data: 0.2453  max mem: 20688
Global Epoch: [2]  [4/7]  eta: 0:00:02  lr: 0.000600  min_lr: 0.000030  loss: 10.0880 (12.0011)  ce_loss: 10.0880 (12.0011)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 12.8588 (15.6265)  time: 0.7896  data: 0.1962  max mem: 20688
Global Epoch: [2]  [5/7]  eta: 0:00:01  lr: 0.000600  min_lr: 0.000030  loss: 8.1861 (11.2029)  ce_loss: 8.1861 (11.2029)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 7.9332 (13.8998)  time: 0.7575  data: 0.1635  max mem: 20688
Global Epoch: [2]  [6/7]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 8.1861 (10.5456)  ce_loss: 8.1861 (10.5456)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 7.9332 (12.5942)  time: 0.7338  data: 0.1402  max mem: 20688
Global Epoch: [2] Total time: 0:00:05 (0.7592 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 8.1861 (10.2791)  ce_loss: 8.1861 (10.2791)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 7.9332 (12.5942)
Global Epoch: [3]  [0/7]  eta: 0:00:11  lr: 0.000600  min_lr: 0.000030  loss: 7.3481 (7.3481)  ce_loss: 7.3481 (7.3481)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.0433 (5.0433)  time: 1.6394  data: 1.0412  max mem: 20688
Global Epoch: [3]  [1/7]  eta: 0:00:06  lr: 0.000600  min_lr: 0.000030  loss: 5.6700 (6.5090)  ce_loss: 5.6700 (6.5090)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.0433 (7.0181)  time: 1.1179  data: 0.5206  max mem: 20688
Global Epoch: [3]  [2/7]  eta: 0:00:04  lr: 0.000600  min_lr: 0.000030  loss: 5.6700 (5.7509)  ce_loss: 5.6700 (5.7509)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 8.9929 (7.7100)  time: 0.9430  data: 0.3471  max mem: 20688
Global Epoch: [3]  [3/7]  eta: 0:00:03  lr: 0.000600  min_lr: 0.000030  loss: 4.6579 (5.4777)  ce_loss: 4.6579 (5.4777)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.2847 (7.1037)  time: 0.8564  data: 0.2604  max mem: 20688
Global Epoch: [3]  [4/7]  eta: 0:00:02  lr: 0.000600  min_lr: 0.000030  loss: 4.6579 (5.3037)  ce_loss: 4.6579 (5.3037)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.2920 (6.7413)  time: 0.8021  data: 0.2083  max mem: 20688
Global Epoch: [3]  [5/7]  eta: 0:00:01  lr: 0.000600  min_lr: 0.000030  loss: 4.6579 (5.2751)  ce_loss: 4.6579 (5.2751)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.2847 (6.4172)  time: 0.7672  data: 0.1736  max mem: 20688
Global Epoch: [3]  [6/7]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 5.1320 (5.3420)  ce_loss: 5.1320 (5.3420)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.2847 (6.1713)  time: 0.7422  data: 0.1488  max mem: 20688
Global Epoch: [3] Total time: 0:00:05 (0.7663 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 5.1320 (5.7851)  ce_loss: 5.1320 (5.7851)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.2847 (6.1713)
Global Epoch: [4]  [0/7]  eta: 0:00:11  lr: 0.000600  min_lr: 0.000030  loss: 5.5126 (5.5126)  ce_loss: 5.5126 (5.5126)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 4.7203 (4.7203)  time: 1.6391  data: 1.0418  max mem: 20688
Global Epoch: [4]  [1/7]  eta: 0:00:06  lr: 0.000600  min_lr: 0.000030  loss: 4.6149 (5.0638)  ce_loss: 4.6149 (5.0638)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 4.7203 (5.9164)  time: 1.1152  data: 0.5210  max mem: 20688
Global Epoch: [4]  [2/7]  eta: 0:00:04  lr: 0.000600  min_lr: 0.000030  loss: 5.5126 (5.5190)  ce_loss: 5.5126 (5.5190)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 6.2344 (6.0224)  time: 0.9405  data: 0.3473  max mem: 20688
Global Epoch: [4]  [3/7]  eta: 0:00:03  lr: 0.000600  min_lr: 0.000030  loss: 5.5126 (5.6324)  ce_loss: 5.5126 (5.6324)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.5637 (5.9077)  time: 0.8535  data: 0.2605  max mem: 20688
Global Epoch: [4]  [4/7]  eta: 0:00:02  lr: 0.000600  min_lr: 0.000030  loss: 5.5126 (5.5544)  ce_loss: 5.5126 (5.5544)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.5637 (5.8137)  time: 0.8013  data: 0.2084  max mem: 20688
Global Epoch: [4]  [5/7]  eta: 0:00:01  lr: 0.000600  min_lr: 0.000030  loss: 5.2424 (5.2477)  ce_loss: 5.2424 (5.2477)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.4376 (5.4850)  time: 0.7671  data: 0.1737  max mem: 20688
Global Epoch: [4]  [6/7]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 5.5126 (5.3062)  ce_loss: 5.5126 (5.3062)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.4376 (5.3677)  time: 0.7434  data: 0.1489  max mem: 20688
Global Epoch: [4] Total time: 0:00:05 (0.7799 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 5.5126 (4.9296)  ce_loss: 5.5126 (4.9296)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 65536.0000 (65536.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: 5.4376 (5.3677)
Load ckpt from ./init_weight/beit3_base_patch16_224.pth
Load state_dict by model_key = model
Position interpolate from 14x14 to 30x30
Weights of BEiT3ForVisualQuestionAnswering not initialized from pretrained model: ['fusion.img_norm.weight', 'fusion.img_norm.bias', 'fusion.text_norm.weight', 'fusion.text_norm.bias', 'fusion.dense.weight', 'fusion.dense.bias', 'head.0.weight', 'head.0.bias', 'head.1.weight', 'head.1.bias', 'head.3.weight', 'head.3.bias']
Weights from pretrained model not used in BEiT3ForVisualQuestionAnswering: ['mlm_head.weight', 'mlm_head.bias', 'mim_head.weight', 'mim_head.bias', 'beit3.encoder.layer_norm.A.weight', 'beit3.encoder.layer_norm.A.bias', 'beit3.encoder.layer_norm.B.weight', 'beit3.encoder.layer_norm.B.bias']
Load 203 image-text pairs from ./data/modal-missing-non-iid/client15-img-text.jsonl. 
Load 5000 image-text pairs from ./data/vqa.rest_val.jsonl. 
Load 203 image-text pairs from ./data/modal-missing-non-iid/client15-img-text.jsonl. 
Global Epoch: [0]  [0/4]  eta: 0:00:07  lr: 0.000600  min_lr: 0.000030  loss: 220.0424 (220.0424)  ce_loss: 220.0424 (220.0424)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 32768.0000 (32768.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.9733  data: 1.2767  max mem: 20688
Global Epoch: [0]  [1/4]  eta: 0:00:03  lr: 0.000600  min_lr: 0.000030  loss: 219.9564 (219.9994)  ce_loss: 219.9564 (219.9994)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (24576.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.2723  data: 0.6384  max mem: 20688
Global Epoch: [0]  [2/4]  eta: 0:00:02  lr: 0.000600  min_lr: 0.000030  loss: 219.9564 (219.7878)  ce_loss: 219.9564 (219.7878)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (19114.6667)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.0441  data: 0.4256  max mem: 20688
Global Epoch: [0]  [3/4]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 219.8833 (219.8117)  ce_loss: 219.8833 (219.8117)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (15360.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.9313  data: 0.3192  max mem: 20688
Global Epoch: [0] Total time: 0:00:04 (1.0079 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 219.8833 (219.9186)  ce_loss: 219.8833 (219.9186)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (15360.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)
Global Epoch: [1]  [0/4]  eta: 0:00:07  lr: 0.000600  min_lr: 0.000030  loss: 219.3388 (219.3388)  ce_loss: 219.3388 (219.3388)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 32768.0000 (32768.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.7811  data: 0.9837  max mem: 20688
Global Epoch: [1]  [1/4]  eta: 0:00:03  lr: 0.000600  min_lr: 0.000030  loss: 219.3388 (220.1800)  ce_loss: 219.3388 (220.1800)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (24576.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.1746  data: 0.4919  max mem: 20688
Global Epoch: [1]  [2/4]  eta: 0:00:01  lr: 0.000600  min_lr: 0.000030  loss: 219.5009 (219.9536)  ce_loss: 219.5009 (219.9536)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (19114.6667)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.9735  data: 0.3280  max mem: 20688
Global Epoch: [1]  [3/4]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 219.3388 (219.7400)  ce_loss: 219.3388 (219.7400)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (15360.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.8729  data: 0.2460  max mem: 20688
Global Epoch: [1] Total time: 0:00:03 (0.9504 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 219.3388 (219.7564)  ce_loss: 219.3388 (219.7564)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (15360.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)
Global Epoch: [2]  [0/4]  eta: 0:00:06  lr: 0.000600  min_lr: 0.000030  loss: 219.1128 (219.1128)  ce_loss: 219.1128 (219.1128)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 32768.0000 (32768.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.5478  data: 0.9429  max mem: 20688
Global Epoch: [2]  [1/4]  eta: 0:00:03  lr: 0.000600  min_lr: 0.000030  loss: 219.1128 (219.8755)  ce_loss: 219.1128 (219.8755)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (24576.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.0600  data: 0.4715  max mem: 20688
Global Epoch: [2]  [2/4]  eta: 0:00:01  lr: 0.000600  min_lr: 0.000030  loss: 219.1128 (219.4936)  ce_loss: 219.1128 (219.4936)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (19114.6667)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.8964  data: 0.3144  max mem: 20688
Global Epoch: [2]  [3/4]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 219.1128 (219.7707)  ce_loss: 219.1128 (219.7707)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (15360.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.8148  data: 0.2358  max mem: 20688
Global Epoch: [2] Total time: 0:00:03 (0.8918 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 219.1128 (219.6505)  ce_loss: 219.1128 (219.6505)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (15360.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)
Global Epoch: [3]  [0/4]  eta: 0:00:06  lr: 0.000600  min_lr: 0.000030  loss: 220.1984 (220.1984)  ce_loss: 220.1984 (220.1984)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 32768.0000 (32768.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.5722  data: 0.9653  max mem: 20688
Global Epoch: [3]  [1/4]  eta: 0:00:03  lr: 0.000600  min_lr: 0.000030  loss: 220.1984 (220.3004)  ce_loss: 220.1984 (220.3004)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (24576.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.0714  data: 0.4827  max mem: 20688
Global Epoch: [3]  [2/4]  eta: 0:00:01  lr: 0.000600  min_lr: 0.000030  loss: 220.1984 (220.1004)  ce_loss: 220.1984 (220.1004)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (19114.6667)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.9056  data: 0.3218  max mem: 20688
Global Epoch: [3]  [3/4]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 220.0198 (220.0802)  ce_loss: 220.0198 (220.0802)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (15360.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.8223  data: 0.2414  max mem: 20688
Global Epoch: [3] Total time: 0:00:03 (0.8977 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 220.0198 (219.9062)  ce_loss: 220.0198 (219.9062)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (15360.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)
Global Epoch: [4]  [0/4]  eta: 0:00:06  lr: 0.000600  min_lr: 0.000030  loss: 219.9427 (219.9427)  ce_loss: 219.9427 (219.9427)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 32768.0000 (32768.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.5965  data: 1.0258  max mem: 20688
Global Epoch: [4]  [1/4]  eta: 0:00:03  lr: 0.000600  min_lr: 0.000030  loss: 218.9389 (219.4408)  ce_loss: 218.9389 (219.4408)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (24576.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 1.0849  data: 0.5130  max mem: 20688
Global Epoch: [4]  [2/4]  eta: 0:00:01  lr: 0.000600  min_lr: 0.000030  loss: 219.9427 (219.7398)  ce_loss: 219.9427 (219.7398)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 16384.0000 (19114.6667)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.9119  data: 0.3420  max mem: 20688
Global Epoch: [4]  [3/4]  eta: 0:00:00  lr: 0.000600  min_lr: 0.000030  loss: 219.5852 (219.7011)  ce_loss: 219.5852 (219.7011)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (15360.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)  time: 0.8280  data: 0.2565  max mem: 20688
Global Epoch: [4] Total time: 0:00:03 (0.9037 s / it)
Averaged stats: lr: 0.000600  min_lr: 0.000030  loss: 219.5852 (219.7801)  ce_loss: 219.5852 (219.7801)  clip_loss_i: nan (nan)  clip_loss_t: nan (nan)  clip_loss_f: nan (nan)  loss_scale: 8192.0000 (15360.0000)  weight_decay: 0.0001 (0.0001)  grad_norm: nan (nan)
[rank3]: Traceback (most recent call last):
[rank3]:   File "/data/home/yuqiwei/Multimodal_Federated/src/main.py", line 197, in <module>
[rank3]:     main(opts)
[rank3]:   File "/data/home/yuqiwei/Multimodal_Federated/src/main.py", line 186, in main
[rank3]:     algo.run(
[rank3]:   File "/data/home/yuqiwei/Multimodal_Federated/src/algorithm/MMFL.py", line 78, in run
[rank3]:     steps_per_epoch = self.sample_num_dict[client_id]["total"] // global_batch_size
[rank3]: KeyError: 34
[rank2]: Traceback (most recent call last):
[rank2]:   File "/data/home/yuqiwei/Multimodal_Federated/src/main.py", line 197, in <module>
[rank2]:     main(opts)
[rank2]:   File "/data/home/yuqiwei/Multimodal_Federated/src/main.py", line 186, in main
[rank2]:     algo.run(
[rank2]:   File "/data/home/yuqiwei/Multimodal_Federated/src/algorithm/MMFL.py", line 78, in run
[rank2]:     steps_per_epoch = self.sample_num_dict[client_id]["total"] // global_batch_size
[rank2]: KeyError: 34
[rank1]: Traceback (most recent call last):
[rank1]:   File "/data/home/yuqiwei/Multimodal_Federated/src/main.py", line 197, in <module>
[rank1]:     main(opts)
[rank1]:   File "/data/home/yuqiwei/Multimodal_Federated/src/main.py", line 186, in main
[rank1]:     algo.run(
[rank1]:   File "/data/home/yuqiwei/Multimodal_Federated/src/algorithm/MMFL.py", line 78, in run
[rank1]:     steps_per_epoch = self.sample_num_dict[client_id]["total"] // global_batch_size
[rank1]: KeyError: 34
Load ckpt from ./init_weight/beit3_base_patch16_224.pth
Load state_dict by model_key = model
Position interpolate from 14x14 to 30x30
W1121 19:12:02.257825 856815 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 856891 closing signal SIGTERM
W1121 19:12:02.261298 856815 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 856892 closing signal SIGTERM
W1121 19:12:02.261759 856815 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 856893 closing signal SIGTERM
E1121 19:12:02.827662 856815 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 3 (pid: 856896) of binary: /data/home/yuqiwei/.conda/envs/pytorch/bin/python
Traceback (most recent call last):
  File "/data/home/yuqiwei/.conda/envs/pytorch/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data/home/yuqiwei/.conda/envs/pytorch/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/home/yuqiwei/.conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 923, in <module>
    main()
  File "/data/home/yuqiwei/.conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/data/home/yuqiwei/.conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/data/home/yuqiwei/.conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/data/home/yuqiwei/.conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/home/yuqiwei/.conda/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-21_19:12:02
  host      : hello-DSS8440
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 856896)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
yongrenx commented 1 week ago

这两天找到原因了,目前没什么问题了