YangSun22 / TC-MoA

Task-Customized Mixture of Adapters for General Image Fusion (CVPR 2024)
73 stars 8 forks source link

ERROR conda.cli.main_run:execute(33): Subprocess for 'conda run ['python', '/15342518312/Image_Fusion_JP/TC-MOA/TC-MoA/main_train.py']' command failed. (See above for error) #1

Closed sove45 closed 7 months ago

sove45 commented 8 months ago

excuse me I raised the error when I was training this network my model setting was

model setting

method_name: TC_MoA_Base #The name given to the current model when saving the model batch_size: 3 #Dataset batch size for each task use_ema: True #Whether to use EMA interval_tau: 4 #tau hyperparameter: represents the number of Blocks between two TC-MoA modules task_num: 1 #Total number of tasks tau_shift_value: 2 #Specific position of TC-MoA in each tau block shift_window_size: 14 #Size of winodw in windowsAttention (in patches) model_type: mae_vit_base_patch16 # mae_vit_large_patch16 or mae_vit_base_patch16

error is Traceback (most recent call last): File "/15342518312/Image_Fusion_JP/TC-MOA/TC-MoA/main_train.py", line 280, in main(args,config) File "/15342518312/Image_Fusion_JP/TC-MOA/TC-MoA/main_train.py", line 230, in main param_new = ema(name, state_dict[name]) KeyError: 'module.Alpha_encoder' ERROR conda.cli.main_run:execute(33): Subprocess for 'conda run ['python', '/15342518312/Image_Fusion_JP/TC-MOA/TC-MoA/main_train.py']' command failed. (See above for error)

sove45 commented 8 months ago

when I deleted the code for name in ema_name_list: param_new = ema(name, state_dict[name])

the code remaining raised the error at [13:32:26.957193] Start training for 20 epochs [13:32:26.959012] log_dir: ./output/log/TC_MoA_Base /15342518312/Image_Fusion_JP/TC-MOA/TC-MoA/util/TwoPath_transforms.py:133: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum. warnings.warn( /15342518312/Image_Fusion_JP/TC-MOA/TC-MoA/util/TwoPath_transforms.py:133: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum. warnings.warn( /15342518312/Image_Fusion_JP/TC-MOA/TC-MoA/util/TwoPath_transforms.py:133: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum. warnings.warn( Traceback (most recent call last): File "/15342518312/Image_Fusion_JP/TC-MOA/TC-MoA/main_train.py", line 279, in main(args,config) File "/15342518312/Image_Fusion_JP/TC-MOA/TC-MoA/main_train.py", line 244, in main train_stats = train_one_epoch( File "/15342518312/Image_Fusion_JP/TC-MOA/TC-MoA/engine_train.py", line 132, in train_one_epoch loss_RGBT,pred= train_one_iter(model,task_dict["VIF"],samples_rgb,samples_t,device,global_rank,[epoch,data_iter_step],optimizer,loss_scaler,ema,rgb_train_info,config) File "/15342518312/Image_Fusion_JP/TC-MOA/TC-MoA/engine_train.py", line 38, in train_one_iter param_new = ema(name, state_dict[name]) KeyError: 'module.Alpha_encoder' ERROR conda.cli.main_run:execute(33): Subprocess for 'conda run ['python', '/15342518312/Image_Fusion_JP/TC-MOA/TC-MoA/main_train.py']' command failed. (See above for error)

Proce

YangSun22 commented 8 months ago

I think it's a question of whether or not to use distributed computing. If you import all parameters in a distributed way there is a "module." in front of the key. Distributed is used by default in the code. Please check the main_train.py

 if config["use_ema"]:
        state_dict = model.state_dict()
        **print(state_dict)**       #Add print here
        ema_name_list = ["module.Alpha_encoder",
                    "module.Alpha_decoder",
                      ]
        ema_name_list += ["module."+i for i in ema_MoA_list]
        ema_name_list += ["module."+i for i in ema_windows_list]
        print("EMA_LIST:",ema_name_list)
        ema = EMA(0.99,ema_name_list)
        for name in ema_name_list:
            param_new = ema(name, state_dict[name])
    else:
        ema=None

and output the key of all parameters to see if there is a "module." Output the key of all parameters to see if there is a "module." If not, remove all "module." from this code.

I'll try to add a parameter selection in the next update, thanks for the feedback!

sove45 commented 8 months ago

I think it's a question of whether or not to use distributed computing. If you import all parameters in a distributed way there is a "module." in front of the key. Distributed is used by default in the code. Please check the main_train.py

 if config["use_ema"]:
        state_dict = model.state_dict()
        **print(state_dict)**       #Add print here
        ema_name_list = ["module.Alpha_encoder",
                    "module.Alpha_decoder",
                      ]
        ema_name_list += ["module."+i for i in ema_MoA_list]
        ema_name_list += ["module."+i for i in ema_windows_list]
        print("EMA_LIST:",ema_name_list)
        ema = EMA(0.99,ema_name_list)
        for name in ema_name_list:
            param_new = ema(name, state_dict[name])
    else:
        ema=None

and output the key of all parameters to see if there is a "module." Output the key of all parameters to see if there is a "module." If not, remove all "module." from this code.

I'll try to add a parameter selection in the next update, thanks for the feedback!

I ran the code successfully after I remove all "module." from this code. thank you very much!