IDEA-Research / detrex

detrex is a research platform for DETR-based object detection, segmentation, pose estimation and other visual recognition tasks.
https://detrex.readthedocs.io/en/latest/
Apache License 2.0
1.9k stars 199 forks source link

Very noisy loss of Dino #304

Closed hg6185 closed 8 months ago

hg6185 commented 10 months ago

Hello,

i am finetuning DINO (SwinB) on a small imbalanced custom dataset. I initially used the 0.0002 LR and the multiplier but then i reduced the LR to 0.0001 like you did within your paper. Also you write that the DINO is very sensitive to Hyperparameter of the LR.

My loss function is very noisy. It decreases in average fast, but I also find that there are high spikes. Did you observe something similiar, or does this could be related to the imbalance of the dataset?

Thank you so much in advance!

rentainhe commented 10 months ago

Hello,

i am finetuning DINO (SwinB) on a small imbalanced custom dataset. I initially used the 0.0002 LR and the multiplier but then i reduced the LR to 0.0001 like you did within your paper. Also you write that the DINO is very sensitive to Hyperparameter of the LR.

My loss function is very noisy. It decreases in average fast, but I also find that there are high spikes. Did you observe something similiar, or does this could be related to the imbalance of the dataset?

Thank you so much in advance!

Would u like to share your training log with us which may be very helpful!

hg6185 commented 10 months ago

Hey, here for a SwinS. I switched due to the smaller size of window attention, since my APs was always very bad. In general, the model converges towards sth. that is good for large objects but rather bad for small objects. I assume that the loss is so noisy because of that. Have you encountered this phenomenon before and do you know if non-optimal learning_rate and small batch_sizes trap the dino model in a non optimal condition? Or even if there are known tricks to improve convergence for smaller objects with DINO?

The two images show training loss and validation loss:

image image

So far, I also set the image size to 640+. Thank you already in advance!

The maximum iterations were around 3000 with a dataset of only roughly 1400 (training): I scheduled after 1 and 2k iterations by 0.5 and respectively 0.1.

sh: readlink: command not found [INFO] Module CUDA/11.3.1 loaded. sh: readlink: command not found [INFO] Module GCC/9.4.0 loaded. [INFO] Module numactl/2.0.14 loaded.

Inactive Modules: 1) UCX/1.12.1

Due to MODULEPATH changes, the following have been reloaded: 1) numactl/2.0.14

The following have been reloaded with a version change: 1) GCCcore/.11.3.0 => GCCcore/.9.4.0 3) zlib/1.2.12 => zlib/1.2.11 2) binutils/2.38 => binutils/2.36.1

/ [09/05 00:05:09 detectron2]: Rank of current process: 0. World size: 2 [09/05 00:05:11 detectron2]: Environment info:


sys.platform linux Python 3.8.17 (default, Jul 5 2023, 20:41:08) [GCC 11.2.0] numpy 1.22.3 detectron2 0.6 @ detectron2/detectron2 Compiler GCC 9.4 CUDA compiler CUDA 11.3 detectron2 arch flags 7.0 DETECTRON2_ENVMODULE PyTorch 1.12.1+cu113 @/home//miniconda3/envs/fps/lib/python3.8/site-packages/torch PyTorch debug build False GPU available Yes GPU 0,1 Tesla V100-SXM2-16GB (arch=7.0) Driver version CUDA_HOME /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/CUDA/11.3.1 Pillow 9.4.0 torchvision 0.13.1+cu113 @/home/cg072483/miniconda3/envs/fps/lib/python3.8/site-packages/torchvision torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6 fvcore 0.1.5.post20221221 iopath 0.1.9 cv2 4.8.0


PyTorch built with:

[09/05 00:05:11 detectron2]: Command line arguments: Namespace(confidence_threshold=0.5, config_file='./detrex/projects/dino/configs/dino-swin/dino_swin_small_224_4scale_12ep.py', img_format='RGB', input=None, max_size_test=1333, metadata_dataset='coco_2017_val', min_size_test=800, num_gpus=1, opts=['train.init_checkpoint=./models/dino_swin_small_224_4scale_12ep.pth'], output=None, resume=False, video_input=None, webcam=False) [09/05 00:05:11 detectron2]: Contents of args.config_file=./detrex/projects/dino/configs/dino-swin/dino_swin_small_224_4scale_12ep.py: from detrex.config import get_config from ..models.dino_swin_small_224 import model

get default config

dataloader = get_config("common/data/coco_detr.py").dataloader optimizer = get_config("common/optim.py").AdamW lr_multiplier = get_config("common/coco_schedule.py").lr_multiplier_12ep train = get_config("common/train.py").train

modify training config

train.init_checkpoint = "/path/to/swin_small_patch4_window7_224.pth" train.output_dir = "./output/dino_swin_small_224_4scale_12ep"

max training iterations

train.max_iter = 90000 train.eval_period = 5000 train.log_period = 20 train.checkpointer.period = 5000

gradient clipping for training

train.clip_grad.enabled = True train.clip_grad.params.max_norm = 0.1 train.clip_grad.params.norm_type = 2

set training devices

train.device = "cuda" model.device = train.device

modify optimizer config

optimizer.lr = 1e-4 optimizer.betas = (0.9, 0.999) optimizer.weight_decay = 1e-4 optimizer.params.lr_factor_func = lambda module_name: 0.1 if "backbone" in module_name else 1

modify dataloader config

dataloader.train.num_workers = 16

please notice that this is total batch size.

surpose you're using 4 gpus for training and the batch size for

each gpu is 16/4 = 4

dataloader.train.total_batch_size = 16

WARNING [09/05 00:05:11 d2.config.lazy]: The config contains objects that cannot serialize to a valid yaml. ./output/dino_swin_small_224_4scale_12ep/config.yaml is human-readable but cannot be loaded. WARNING [09/05 00:05:11 d2.config.lazy]: Config is saved using cloudpickle at ./output/dino_swin_small_224_4scale_12ep/config.yaml.pkl. [09/05 00:05:11 detectron2]: Full config saved to ./output/dino_swin_small_224_4scale_12ep/config.yaml [09/05 00:05:11 d2.utils.env]: Using a generated random seed 11479259

Backbone was sucessfully frozen! Trainable params: 20518719

DINO( (backbone): SwinTransformer( (patch_embed): PatchEmbed( (proj): Conv2d(3, 96, kernel_size=(4, 4), stride=(4, 4)) (norm): LayerNorm((96,), eps=1e-05, elementwise_affine=True) ) (pos_drop): Dropout(p=0.0, inplace=False) (layers): ModuleList( (0): BasicLayer( (blocks): ModuleList( (0): SwinTransformerBlock( (norm1): LayerNorm((96,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=96, out_features=288, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=96, out_features=96, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): Identity() (norm2): LayerNorm((96,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=96, out_features=384, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=384, out_features=96, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (1): SwinTransformerBlock( (norm1): LayerNorm((96,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=96, out_features=288, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=96, out_features=96, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.009) (norm2): LayerNorm((96,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=96, out_features=384, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=384, out_features=96, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) ) (downsample): PatchMerging( (reduction): Linear(in_features=384, out_features=192, bias=False) (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True) ) ) (1): BasicLayer( (blocks): ModuleList( (0): SwinTransformerBlock( (norm1): LayerNorm((192,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=192, out_features=576, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=192, out_features=192, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.017) (norm2): LayerNorm((192,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=192, out_features=768, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=768, out_features=192, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (1): SwinTransformerBlock( (norm1): LayerNorm((192,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=192, out_features=576, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=192, out_features=192, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.026) (norm2): LayerNorm((192,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=192, out_features=768, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=768, out_features=192, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) ) (downsample): PatchMerging( (reduction): Linear(in_features=768, out_features=384, bias=False) (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) ) (2): BasicLayer( (blocks): ModuleList( (0): SwinTransformerBlock( (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=384, out_features=1152, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=384, out_features=384, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.035) (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (1): SwinTransformerBlock( (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=384, out_features=1152, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=384, out_features=384, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.043) (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (2): SwinTransformerBlock( (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=384, out_features=1152, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=384, out_features=384, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.052) (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (3): SwinTransformerBlock( (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=384, out_features=1152, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=384, out_features=384, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.061) (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (4): SwinTransformerBlock( (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=384, out_features=1152, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=384, out_features=384, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.070) (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (5): SwinTransformerBlock( (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=384, out_features=1152, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=384, out_features=384, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.078) (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (6): SwinTransformerBlock( (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=384, out_features=1152, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=384, out_features=384, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.087) (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (7): SwinTransformerBlock( (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=384, out_features=1152, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=384, out_features=384, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.096) (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (8): SwinTransformerBlock( (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=384, out_features=1152, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=384, out_features=384, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.104) (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (9): SwinTransformerBlock( (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=384, out_features=1152, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=384, out_features=384, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.113) (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (10): SwinTransformerBlock( (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=384, out_features=1152, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=384, out_features=384, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.122) (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (11): SwinTransformerBlock( (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=384, out_features=1152, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=384, out_features=384, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.130) (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (12): SwinTransformerBlock( (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=384, out_features=1152, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=384, out_features=384, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.139) (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (13): SwinTransformerBlock( (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=384, out_features=1152, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=384, out_features=384, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.148) (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (14): SwinTransformerBlock( (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=384, out_features=1152, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=384, out_features=384, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.157) (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (15): SwinTransformerBlock( (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=384, out_features=1152, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=384, out_features=384, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.165) (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (16): SwinTransformerBlock( (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=384, out_features=1152, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=384, out_features=384, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.174) (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (17): SwinTransformerBlock( (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=384, out_features=1152, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=384, out_features=384, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.183) (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=384, out_features=1536, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=1536, out_features=384, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) ) (downsample): PatchMerging( (reduction): Linear(in_features=1536, out_features=768, bias=False) (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True) ) ) (3): BasicLayer( (blocks): ModuleList( (0): SwinTransformerBlock( (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=768, out_features=2304, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=768, out_features=768, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.191) (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=768, out_features=3072, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=3072, out_features=768, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) (1): SwinTransformerBlock( (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): WindowAttention( (qkv): Linear(in_features=768, out_features=2304, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=768, out_features=768, bias=True) (proj_drop): Dropout(p=0.0, inplace=False) (softmax): Softmax(dim=-1) ) (drop_path): DropPath(drop_prob=0.200) (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Mlp( (fc1): Linear(in_features=768, out_features=3072, bias=True) (act): GELU(approximate=none) (fc2): Linear(in_features=3072, out_features=768, bias=True) (drop): Dropout(p=0.0, inplace=False) ) ) ) ) ) (norm1): LayerNorm((192,), eps=1e-05, elementwise_affine=True) (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True) (norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (position_embedding): PositionEmbeddingSine() (neck): ChannelMapper( (convs): ModuleList( (0): ConvNormAct( (conv): Conv2d(192, 256, kernel_size=(1, 1), stride=(1, 1)) (norm): GroupNorm(32, 256, eps=1e-05, affine=True) ) (1): ConvNormAct( (conv): Conv2d(384, 256, kernel_size=(1, 1), stride=(1, 1)) (norm): GroupNorm(32, 256, eps=1e-05, affine=True) ) (2): ConvNormAct( (conv): Conv2d(768, 256, kernel_size=(1, 1), stride=(1, 1)) (norm): GroupNorm(32, 256, eps=1e-05, affine=True) ) ) (extra_convs): ModuleList( (0): ConvNormAct( (conv): Conv2d(768, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)) (norm): GroupNorm(32, 256, eps=1e-05, affine=True) ) ) ) (transformer): DINOTransformer( (encoder): DINOTransformerEncoder( (layers): ModuleList( (0): BaseTransformerLayer( (attentions): ModuleList( (0): MultiScaleDeformableAttention( (dropout): Dropout(p=0.0, inplace=False) (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) ) (ffns): ModuleList( (0): FFN( (activation): ReLU(inplace=True) (layers): Sequential( (0): Sequential( (0): Linear(in_features=256, out_features=2048, bias=True) (1): ReLU(inplace=True) (2): Dropout(p=0.0, inplace=False) ) (1): Linear(in_features=2048, out_features=256, bias=True) (2): Dropout(p=0.0, inplace=False) ) ) ) (norms): ModuleList( (0): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) ) (1): BaseTransformerLayer( (attentions): ModuleList( (0): MultiScaleDeformableAttention( (dropout): Dropout(p=0.0, inplace=False) (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) ) (ffns): ModuleList( (0): FFN( (activation): ReLU(inplace=True) (layers): Sequential( (0): Sequential( (0): Linear(in_features=256, out_features=2048, bias=True) (1): ReLU(inplace=True) (2): Dropout(p=0.0, inplace=False) ) (1): Linear(in_features=2048, out_features=256, bias=True) (2): Dropout(p=0.0, inplace=False) ) ) ) (norms): ModuleList( (0): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) ) (2): BaseTransformerLayer( (attentions): ModuleList( (0): MultiScaleDeformableAttention( (dropout): Dropout(p=0.0, inplace=False) (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) ) (ffns): ModuleList( (0): FFN( (activation): ReLU(inplace=True) (layers): Sequential( (0): Sequential( (0): Linear(in_features=256, out_features=2048, bias=True) (1): ReLU(inplace=True) (2): Dropout(p=0.0, inplace=False) ) (1): Linear(in_features=2048, out_features=256, bias=True) (2): Dropout(p=0.0, inplace=False) ) ) ) (norms): ModuleList( (0): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) ) (3): BaseTransformerLayer( (attentions): ModuleList( (0): MultiScaleDeformableAttention( (dropout): Dropout(p=0.0, inplace=False) (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) ) (ffns): ModuleList( (0): FFN( (activation): ReLU(inplace=True) (layers): Sequential( (0): Sequential( (0): Linear(in_features=256, out_features=2048, bias=True) (1): ReLU(inplace=True) (2): Dropout(p=0.0, inplace=False) ) (1): Linear(in_features=2048, out_features=256, bias=True) (2): Dropout(p=0.0, inplace=False) ) ) ) (norms): ModuleList( (0): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) ) (4): BaseTransformerLayer( (attentions): ModuleList( (0): MultiScaleDeformableAttention( (dropout): Dropout(p=0.0, inplace=False) (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) ) (ffns): ModuleList( (0): FFN( (activation): ReLU(inplace=True) (layers): Sequential( (0): Sequential( (0): Linear(in_features=256, out_features=2048, bias=True) (1): ReLU(inplace=True) (2): Dropout(p=0.0, inplace=False) ) (1): Linear(in_features=2048, out_features=256, bias=True) (2): Dropout(p=0.0, inplace=False) ) ) ) (norms): ModuleList( (0): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) ) (5): BaseTransformerLayer( (attentions): ModuleList( (0): MultiScaleDeformableAttention( (dropout): Dropout(p=0.0, inplace=False) (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) ) (ffns): ModuleList( (0): FFN( (activation): ReLU(inplace=True) (layers): Sequential( (0): Sequential( (0): Linear(in_features=256, out_features=2048, bias=True) (1): ReLU(inplace=True) (2): Dropout(p=0.0, inplace=False) ) (1): Linear(in_features=2048, out_features=256, bias=True) (2): Dropout(p=0.0, inplace=False) ) ) ) (norms): ModuleList( (0): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) ) ) ) (decoder): DINOTransformerDecoder( (layers): ModuleList( (0): BaseTransformerLayer( (attentions): ModuleList( (0): MultiheadAttention( (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (proj_drop): Dropout(p=0.0, inplace=False) ) (1): MultiScaleDeformableAttention( (dropout): Dropout(p=0.0, inplace=False) (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) ) (ffns): ModuleList( (0): FFN( (activation): ReLU(inplace=True) (layers): Sequential( (0): Sequential( (0): Linear(in_features=256, out_features=2048, bias=True) (1): ReLU(inplace=True) (2): Dropout(p=0.0, inplace=False) ) (1): Linear(in_features=2048, out_features=256, bias=True) (2): Dropout(p=0.0, inplace=False) ) ) ) (norms): ModuleList( (0): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) ) (1): BaseTransformerLayer( (attentions): ModuleList( (0): MultiheadAttention( (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (proj_drop): Dropout(p=0.0, inplace=False) ) (1): MultiScaleDeformableAttention( (dropout): Dropout(p=0.0, inplace=False) (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) ) (ffns): ModuleList( (0): FFN( (activation): ReLU(inplace=True) (layers): Sequential( (0): Sequential( (0): Linear(in_features=256, out_features=2048, bias=True) (1): ReLU(inplace=True) (2): Dropout(p=0.0, inplace=False) ) (1): Linear(in_features=2048, out_features=256, bias=True) (2): Dropout(p=0.0, inplace=False) ) ) ) (norms): ModuleList( (0): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) ) (2): BaseTransformerLayer( (attentions): ModuleList( (0): MultiheadAttention( (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (proj_drop): Dropout(p=0.0, inplace=False) ) (1): MultiScaleDeformableAttention( (dropout): Dropout(p=0.0, inplace=False) (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) ) (ffns): ModuleList( (0): FFN( (activation): ReLU(inplace=True) (layers): Sequential( (0): Sequential( (0): Linear(in_features=256, out_features=2048, bias=True) (1): ReLU(inplace=True) (2): Dropout(p=0.0, inplace=False) ) (1): Linear(in_features=2048, out_features=256, bias=True) (2): Dropout(p=0.0, inplace=False) ) ) ) (norms): ModuleList( (0): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) ) (3): BaseTransformerLayer( (attentions): ModuleList( (0): MultiheadAttention( (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (proj_drop): Dropout(p=0.0, inplace=False) ) (1): MultiScaleDeformableAttention( (dropout): Dropout(p=0.0, inplace=False) (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) ) (ffns): ModuleList( (0): FFN( (activation): ReLU(inplace=True) (layers): Sequential( (0): Sequential( (0): Linear(in_features=256, out_features=2048, bias=True) (1): ReLU(inplace=True) (2): Dropout(p=0.0, inplace=False) ) (1): Linear(in_features=2048, out_features=256, bias=True) (2): Dropout(p=0.0, inplace=False) ) ) ) (norms): ModuleList( (0): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) ) (4): BaseTransformerLayer( (attentions): ModuleList( (0): MultiheadAttention( (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (proj_drop): Dropout(p=0.0, inplace=False) ) (1): MultiScaleDeformableAttention( (dropout): Dropout(p=0.0, inplace=False) (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) ) (ffns): ModuleList( (0): FFN( (activation): ReLU(inplace=True) (layers): Sequential( (0): Sequential( (0): Linear(in_features=256, out_features=2048, bias=True) (1): ReLU(inplace=True) (2): Dropout(p=0.0, inplace=False) ) (1): Linear(in_features=2048, out_features=256, bias=True) (2): Dropout(p=0.0, inplace=False) ) ) ) (norms): ModuleList( (0): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) ) (5): BaseTransformerLayer( (attentions): ModuleList( (0): MultiheadAttention( (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (proj_drop): Dropout(p=0.0, inplace=False) ) (1): MultiScaleDeformableAttention( (dropout): Dropout(p=0.0, inplace=False) (sampling_offsets): Linear(in_features=256, out_features=256, bias=True) (attention_weights): Linear(in_features=256, out_features=128, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) ) (ffns): ModuleList( (0): FFN( (activation): ReLU(inplace=True) (layers): Sequential( (0): Sequential( (0): Linear(in_features=256, out_features=2048, bias=True) (1): ReLU(inplace=True) (2): Dropout(p=0.0, inplace=False) ) (1): Linear(in_features=2048, out_features=256, bias=True) (2): Dropout(p=0.0, inplace=False) ) ) ) (norms): ModuleList( (0): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) ) ) (ref_point_head): MLP( (layers): ModuleList( (0): Linear(in_features=512, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) ) ) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (class_embed): ModuleList( (0): Linear(in_features=256, out_features=5, bias=True) (1): Linear(in_features=256, out_features=5, bias=True) (2): Linear(in_features=256, out_features=5, bias=True) (3): Linear(in_features=256, out_features=5, bias=True) (4): Linear(in_features=256, out_features=5, bias=True) (5): Linear(in_features=256, out_features=5, bias=True) (6): Linear(in_features=256, out_features=5, bias=True) ) (bbox_embed): ModuleList( (0): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) (1): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) (2): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) (3): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) (4): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) (5): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) (6): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) ) ) (tgt_embed): Embedding(900, 256) (enc_output): Linear(in_features=256, out_features=256, bias=True) (enc_output_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (class_embed): ModuleList( (0): Linear(in_features=256, out_features=5, bias=True) (1): Linear(in_features=256, out_features=5, bias=True) (2): Linear(in_features=256, out_features=5, bias=True) (3): Linear(in_features=256, out_features=5, bias=True) (4): Linear(in_features=256, out_features=5, bias=True) (5): Linear(in_features=256, out_features=5, bias=True) (6): Linear(in_features=256, out_features=5, bias=True) ) (bbox_embed): ModuleList( (0): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) (1): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) (2): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) (3): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) (4): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) (5): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) (6): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=4, bias=True) ) ) ) (criterion): Criterion DINOCriterion matcher: Matcher HungarianMatcher cost_class: 2.0 cost_bbox: 5.0 cost_giou: 2.0 cost_class_type: focal_loss_cost focal cost alpha: 0.25 focal cost gamma: 2.0 losses: ['class', 'boxes'] loss_class_type: focal_loss weight_dict: {'loss_class': 1, 'loss_bbox': 5.0, 'loss_giou': 2.0, 'loss_class_dn': 1, 'loss_bbox_dn': 5.0, 'loss_giou_dn': 2.0, 'loss_class_enc': 1, 'loss_bbox_enc': 5.0, 'loss_giou_enc': 2.0, 'loss_class_dn_enc': 1, 'loss_bbox_dn_enc': 5.0, 'loss_giou_dn_enc': 2.0, 'loss_class_0': 1, 'loss_bbox_0': 5.0, 'loss_giou_0': 2.0, 'loss_class_dn_0': 1, 'loss_bbox_dn_0': 5.0, 'loss_giou_dn_0': 2.0, 'loss_class_1': 1, 'loss_bbox_1': 5.0, 'loss_giou_1': 2.0, 'loss_class_dn_1': 1, 'loss_bbox_dn_1': 5.0, 'loss_giou_dn_1': 2.0, 'loss_class_2': 1, 'loss_bbox_2': 5.0, 'loss_giou_2': 2.0, 'loss_class_dn_2': 1, 'loss_bbox_dn_2': 5.0, 'loss_giou_dn_2': 2.0, 'loss_class_3': 1, 'loss_bbox_3': 5.0, 'loss_giou_3': 2.0, 'loss_class_dn_3': 1, 'loss_bbox_dn_3': 5.0, 'loss_giou_dn_3': 2.0, 'loss_class_4': 1, 'loss_bbox_4': 5.0, 'loss_giou_4': 2.0, 'loss_class_dn_4': 1, 'loss_bbox_dn_4': 5.0, 'loss_giou_dn_4': 2.0} num_classes: 5 eos_coef: None focal loss alpha: 0.25 focal loss gamma: 2.0 (label_enc): Embedding(5, 256) )

rentainhe commented 9 months ago

Sorry for the late reply, I think this is a normal loss fluctuation phenomenon. After the learning rate decreases, this situation will be alleviated and the model will gradually converge. @hg6185

hg6185 commented 8 months ago

Sorry, I meant to thank you :)