Closed MingkunLishigure closed 11 months ago
This is an interesting behaviour, I tested this configuration: convnext_tiny.in12k_ft_in1k_384 as model_name and loaded the: "convnext_tiny_1k_224_ema.pth" as a checkpoint.
With this result: Recall@1: 90.3579 - Recall@5: 97.2131 - Recall@10: 98.0214 - Recall@top1: 98.1508 - AP: 91.9182
Can you send me the dataclass configuration (class TrainingConfiguration) you used?
Mine was:
@dataclass class TrainingConfiguration:
# Model
model: str = 'convnext_tiny.in12k_ft_in1k_384'
# Override model image size
img_size: int = 384
# Training
mixed_precision: bool = True
custom_sampling: bool = True # use custom sampling instead of random
seed = 1
epochs: int = 1
batch_size: int = 128 # keep in mind real_batch_size = 2 * batch_size
verbose: bool = True
gpu_ids: tuple = (0,1,2,3,4,5,6,7) # GPU ids for training
# Eval
batch_size_eval: int = 128
eval_every_n_epoch: int = 1 # eval every n Epoch
normalize_features: bool = True
eval_gallery_n: int = -1 # -1 for all or int
# Optimizer
clip_grad = 100. # None | float
decay_exclue_bias: bool = False
grad_checkpointing: bool = False # Gradient Checkpointing
# Loss
label_smoothing: float = 0.1
# Learning Rate
lr: float = 0.001 # 1 * 10^-4 for ViT | 1 * 10^-1 for CNN
scheduler: str = "cosine" # "polynomial" | "cosine" | "constant" | None
warmup_epochs: int = 0.1
lr_end: float = 0.0001 # only for "polynomial"
gradient_accumulation: int = 1
# Dataset
dataset: str = 'U1652-D2S' # 'U1652-D2S' | 'U1652-S2D'
data_folder: str = "./data/U1652"
single_sample: bool = False
# Augment Images
prob_flip: float = 0.5 # flipping the sat image and drone image simultaneously
# Savepath for model checkpoints
model_path: str = "./university_e40_eval4_384_aug_final"
# Eval before training
zero_shot: bool = False
# Checkpoint to start from
checkpoint_start = "convnext_tiny_1k_224_ema.pth"
# set num_workers to 0 if on Windows
num_workers: int = 0 if os.name == 'nt' else 4
# train on GPU if available
device: str = 'cuda' if torch.cuda.is_available() else 'cpu'
# for better performance
cudnn_benchmark: bool = True
# make cudnn deterministic
cudnn_deterministic: bool = False`
OK, this is the dataclass configuration we used in training stage:
class Configuration:
# Model
# model: str = 'convnext_base.fb_in22k_ft_in1k_384'
model: str = 'convnext_tiny.fb_in22k_ft_in1k_384'
# Override model image size
img_size: int = 384
# Training
mixed_precision: bool = True
custom_sampling: bool = True # use custom sampling instead of random
seed = 1
epochs: int = 1
batch_size: int = 128 # keep in mind real_batch_size = 2 * batch_size
verbose: bool = True
gpu_ids: tuple = (0,1,2,3) # GPU ids for training
# Eval
batch_size_eval: int = 128
eval_every_n_epoch: int = 1 # eval every n Epoch
normalize_features: bool = True
eval_gallery_n: int = -1 # -1 for all or int
# Optimizer
clip_grad = 100. # None | float
decay_exclue_bias: bool = False
grad_checkpointing: bool = False # Gradient Checkpointing
# Loss
label_smoothing: float = 0.1
# Learning Rate
lr: float = 0.001 # 1 * 10^-4 for ViT | 1 * 10^-1 for CNN
scheduler: str = "cosine" # "polynomial" | "cosine" | "constant" | None
warmup_epochs: int = 0.1
lr_end: float = 0.0001 # only for "polynomial"
# Dataset
dataset: str = 'U1652-D2S' # 'U1652-D2S' | 'U1652-S2D'
data_folder: str = "/root/datasets/University1652"
# Augment Images
prob_flip: float = 0.5 # flipping the sat image and drone image simultaneously
# Savepath for model checkpoints
model_path: str = "./university"
# Eval before training
zero_shot: bool = False
# Checkpoint to start from
checkpoint_start = None
# set num_workers to 0 if on Windows
num_workers: int = 0 if os.name == 'nt' else 4
# train on GPU if available
device: str = 'cuda' if torch.cuda.is_available() else 'cpu'
# for better performance
cudnn_benchmark: bool = True
# make cudnn deterministic
cudnn_deterministic: bool = False
This looks good, how does the output of the training script looks like?
The output of training script:
Model: convnext_tiny.fb_in22k_ft_in1k_384
{'input_size': (3, 384, 384), 'interpolation': 'bicubic', 'mean': (0.485, 0.456, 0.406), 'std': (0.229, 0.224, 0.225), 'crop_pct': 1.0, 'crop_mode': 'squash'}
GPUs available: 4
Image Size Query: (384, 384)
Image Size Ground: (384, 384)
Mean: (0.485, 0.456, 0.406)
Std: (0.229, 0.224, 0.225)
Query Images Test: 37855
Gallery Images Test: 951
Scheduler: cosine - max LR: 0.001
Warmup Epochs: 0.1 - Warmup Steps: 29.6
Train Epochs: 1 - Train Steps: 296
Shuffle Dataset:
Original Length: 37854 - Length after Shuffle: 37760
Break Counter: 512
Pairs left out of last batch to avoid creating noise: 94
First Element ID: 1094 - Last Element ID: 1508
------------------------------[Epoch: 1]------------------------------
Epoch: 1, Train Loss = 4.487, Lr = 0.000000
------------------------------[Evaluate]------------------------------
Extract Features:
Compute Scores:
Recall@1: 1.7726 - Recall@5: 5.7192 - Recall@10: 9.4439 - Recall@top1: 10.0991 - AP: 3.3876
Shuffle Dataset:
Original Length: 37854 - Length after Shuffle: 37760
Break Counter: 512
Pairs left out of last batch to avoid creating noise: 94
First Element ID: 0945 - Last Element ID: 1646
I tried it on another machine cloned the repo and downloaded the U1652 data. But I am very sorry I can not reproduce the issue.
Model: convnext_tiny.fb_in22k_ft_in1k_384
{'input_size': (3, 384, 384), 'interpolation': 'bicubic', 'mean': (0.485, 0.456, 0.406), 'std': (0.229, 0.224, 0.225), 'crop_pct': 1.0, 'crop_mode': 'squash'}
GPUs available: 1
Image Size Query: (384, 384)
Image Size Ground: (384, 384)
Mean: (0.485, 0.456, 0.406)
Std: (0.229, 0.224, 0.225)
Query Images Test: 37855
Gallery Images Test: 951
Scheduler: cosine - max LR: 0.001
Warmup Epochs: 0.1 - Warmup Steps: 59.2
Train Epochs: 1 - Train Steps: 592
Shuffle Dataset:
40185it [00:00, 397875.71it/s]
Original Length: 37854 - Length after Shuffle: 37824
Break Counter: 512
Pairs left out of last batch to avoid creating noise: 30
First Element ID: 1094 - Last Element ID: 1144
------------------------------[Epoch: 1]------------------------------
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 591/591 [05:12<00:00, 1.89it/s, loss=0.7923, loss_avg=0.9036, lr=0.000000]
Epoch: 1, Train Loss = 0.904, Lr = 0.000000
------------------------------[Evaluate]------------------------------
Extract Features:
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 296/296 [01:15<00:00, 3.92it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.02it/s]
Compute Scores:
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37855/37855 [00:13<00:00, 2869.80it/s]
Recall@1: 92.4950 - Recall@5: 98.0584 - Recall@10: 98.7822 - Recall@top1: 98.8799 - AP: 93.7657
Shuffle Dataset:
40271it [00:00, 390630.03it/s]
Original Length: 37854 - Length after Shuffle: 37824
Break Counter: 512
Pairs left out of last batch to avoid creating noise: 30
First Element ID: 0945 - Last Element ID: 1119
My output seems fine. Did you change anything else in the code?
In the training script, we only change this part
config = Configuration()
if config.dataset == 'U1652-D2S':
config.query_folder_train = '/root/data1/datasets/University1652/train/satellite'
config.gallery_folder_train = '/root/data1/datasets/University1652/train/drone'
config.query_folder_test = '/root/data1/datasets/University1652/test/query_drone'
config.gallery_folder_test = '/root/data1/datasets/University1652/test/gallery_satellite'
elif config.dataset == 'U1652-S2D':
config.query_folder_train = '/root/data1/datasets/University1652/train/satellite'
config.gallery_folder_train = '/root/data1/datasets/University1652/train/drone'
config.query_folder_test = '/root/data1/datasets/University1652/test/query_satellite'
config.gallery_folder_test = '/root/data1/datasets/University1652/test/gallery_drone'
model = TimmModel(config.model,
pretrained=False,
img_size=config.img_size)
model_state_dict = torch.load('{}.pth'.format(config.model))
model.load_state_dict(model_state_dict, strict=False)
And is the model loading required? We set "None" in training phase. ``
checkpoint_start = "convnext_tiny_1k_224_ema.pth"
Thank you, I think I found the culprit. When you set pretrained=False
it does not download the weights from pytorch-image-models. The issue here is that the implementation of pytorch-image-models and the original ConvNext implementation differs a bit, the 1x1 Convolutions can be implemented with a linear layer or with a Conv2D from PyTorch, thus resulting in issues when loading the weights.
So the solution here would be:
pretrained=True
model.py
with timm.create_models()
creating the models with the code provided in the original repo, then the loading of the original convnext weights should work fine
Thanks for your response! However, the current situation is that if I train using the default parameters you suggested, the performance of the University-1652 dataset network will be poor in the Epoch=1 phase when using a 4*3090GPU.
The problem we notice is that the loss does not decrease throughout the training process. The only difference from the default code in training is that we downloaded the ConvNeXt-T model from https://github.com/facebookresearch/ConvNeXt fine-tuned on the ImageNet-1k dataset and load it locally via
model_state_dict = torch.load('. /pretrained/university/{}.pth'.format(config.model)) model.load_state_dict(model_state_dict, strict=False)
Originally posted by @MingkunLishigure in https://github.com/Skyy93/Sample4Geo/issues/1#issuecomment-1761810227