YangSun22 / TC-MoA

Task-Customized Mixture of Adapters for General Image Fusion (CVPR 2024)
58 stars 4 forks source link

train failed when I trained the net work on #2

Closed sove45 closed 3 months ago

sove45 commented 3 months ago

excuse me ,I met some diffusicults when I was training the network my cfg setting was:

GPU setting

device: cuda

Training setting

seed: 0 upsample: true #Whether to upsample 2x in the network lr: 1.5e-4 #Initial learning rate min_lr: 1.0e-6 #Minimum value of progressively decreasing learning rate num_workers: 3 weight_decay: 0.05

epochs: 20 #Total number of training epochs required warmup_epochs: 2 #Warm-up epochs load_start_epoch: 0 #Only necessary if you need to retrain at breakpoints: read parameters from i-th epoch, to facilitate the calculation of learning rate and other hyperparameters

log_dir: ./output/log/ #Path to save the log output_dir: /15342518312/Image_Fusion_JP/TC-MOA/output/all_in_one/ #Paths to the output model output_img_dir: ./output/img/ #Path to output intermediate fusion results pretrain_weight_path: /15342518312/Image_Fusion_JP/TC-MOA/checkpoint/mae_visualize_vit_large_ganloss.pth #Path to the parameters of the pre-trained base model ckp_path: None #Only required if breakpoint retraining is needed: Path of TC-MoA model parameters to be imported. save_img_interval: 64 #Iteration interval for outputting intermediate fusion results

model setting

method_name: TC_MoA_Base #The name given to the current model when saving the model batch_size: 5 #Dataset batch size for each task use_ema: True #Whether to use EMA interval_tau: 4 #tau hyperparameter: represents the number of Blocks between two TC-MoA modules task_num: 1 #Total number of tasks tau_shift_value: 2 #Specific position of TC-MoA in each tau block shift_window_size: 14 #Size of winodw in windowsAttention (in patches) model_type: mae_vit_large_patch16 # mae_vit_large_patch16 or mae_vit_base_patch16

Task setting

VIF: true #Whether or not to train the VIF task VIF_dataset_dict: #Name and path of the datasets to be trained LLVIP: /15342518312/z_datas/train/train/LLVIP arbitrary_input_size: false

MEF: true #Whether or not to train the MEF task MEF_dataset_dict: SCIE: /15342518312/z_datas/train/train/SCIE

MFF: true #Whether or not to train the MFF task MFF_dataset_dict: RealMFF: /15342518312/z_datas/train/train/RealMFF MFI-WHU: /15342518312/z_datas/train/train/MFI-WHU

parameters in main_train.py: def get_args_parser():

config path

parser = argparse.ArgumentParser('TC-MoA', add_help=False)
parser.add_argument('--config_path', default='/15342518312/Image_Fusion_JP/TC-MOA/config/base.yaml', type=str,
                    help='config_path to load')
# Dataset parameters
parser.add_argument('--pin_mem', action='store_true',
                    help='Pin CPU memory in DataLoader for more efficient (sometimes) transfer to GPU.')
parser.add_argument('--no_pin_mem', action='store_false', dest='pin_mem')
parser.set_defaults(pin_mem=False)
# distributed training parameters
parser.add_argument('--world_size', default=8, type=int,
                    help='number of distributed processes')
parser.add_argument('--local_rank',  type=int)
parser.add_argument('--dist_on_itp', action='store_true')
parser.add_argument('--dist_url', default='env://',
                    help='url used to set up distributed training')
parser.add_argument('--distributed', default=True,
                    help='url used to set up distributed training')

return parser

after 19 epoches' training Ievaluated it on main_predict.py however ,for any image in the test dataset its output was all dark pictures i can make sure that the loss when i was train was not 0 and the ouutputs are not zeros tensor too

sove45 commented 3 months ago

240045

sove45 commented 3 months ago

tensor([[[0.3049, 0.3169, 0.3132], [0.3053, 0.3120, 0.3109], [0.3041, 0.3120, 0.3122], ..., [0.4832, 0.4807, 0.4445], [0.5109, 0.5072, 0.4656], [0.5029, 0.5009, 0.4625]],

    [[0.3056, 0.3163, 0.3120],
     [0.3082, 0.3143, 0.3108],
     [0.3062, 0.3097, 0.3063],
     ...,
     [0.5232, 0.5202, 0.4902],
     [0.5045, 0.5032, 0.4752],
     [0.4963, 0.4953, 0.4654]],

    [[0.3008, 0.3087, 0.3026],
     [0.3032, 0.3059, 0.3039],
     [0.3125, 0.3127, 0.3114],
     ...,
     [0.5351, 0.5320, 0.5114],
     [0.5050, 0.5023, 0.4826],
     [0.4849, 0.4823, 0.4615]],

    ...,

    [[0.4359, 0.6987, 0.7405],
     [0.6346, 0.9053, 0.9439],
     [0.7060, 0.9509, 0.9678],
     ...,
     [0.3982, 0.4003, 0.4066],
     [0.4084, 0.4132, 0.4159],
     [0.4065, 0.4151, 0.4086]],

    [[0.2979, 0.5256, 0.5448],
     [0.5231, 0.7690, 0.7992],
     [0.5284, 0.7738, 0.8000],
     ...,
     [0.3249, 0.3323, 0.3354],
     [0.3436, 0.3517, 0.3460],
     [0.3441, 0.3567, 0.3392]],

    [[0.1649, 0.3592, 0.3805],
     [0.2035, 0.4144, 0.4569],
     [0.1690, 0.3919, 0.4409],
     ...,
     [0.0707, 0.0807, 0.0797],
     [0.0780, 0.0900, 0.0839],
     [0.0870, 0.1045, 0.0871]]])

the output Image tensor

sove45 commented 3 months ago

sry It's my carelessness I fixed the bug by replace img = img.permute(1,2,0) to img = img.permute(1,2,0)*255

YangSun22 commented 3 months ago

Firstly, I would like to confirm that . /output/img/ contains any intermediate fusion results saved from the training process. And whether it is also a solid black image. Then, if the in-process results are also black, my current guess is that it's an image saving issue. The version of from scipy.misc import imsave that I use in dataloader is scipy==1.2.1. using this function save saves a colour image directly from the normalised matrix.

sove45 commented 3 months ago

yes,to solve the version problem,I replaced imsave by cv2.imwrite

Amelia0109 commented 2 months ago

excuse me,can you share a copy of the code you have modified?I am also facing this problem now. Thank you! @sove45