ashkamath / mdetr

Apache License 2.0
978 stars 128 forks source link

Logs for the fine tuning on LVIS detection #55

Open Flaick opened 3 years ago

Flaick commented 3 years ago

Hello, I am wondering if there is log file available for the fine tuning on 1% LVIS few shot detection.

Flaick commented 3 years ago

Also, I will be grateful if you can provide the hyperparameter setting for the 1% experiment

TopCoder2K commented 2 years ago

@Flaick, have you managed to run the fine-tuning?

I have a strange error. When I run python main.py --dataset_config configs/lvis.json --load pretrained_resnet101_checkpoint.pth --ema --epochs 150 --lr_drop 120 --eval_skip 5 on GPU, I get:

Epoch: [0]  [    0/73902]  eta: 1 day, 13:20:57  lr: 0.000100  lr_backbone: 0.000010  lr_text_encoder: 0.000000  loss: 14.1489 (14.1489)  loss_ce: 2.3089 (2.3089)  loss_bbox: 0.0000 (0.0000)  loss_giou: 0.0000 (0.0000)  loss_contrastive_align: 0.0000 (0.0000)  loss_ce_0: 2.2728 (2.2728)  loss_bbox_0: 0.0000 (0.0000)  loss_giou_0: 0.0000 (0.0000)  loss_contrastive_align_0: 0.0000 (0.0000)  loss_ce_1: 2.1969 (2.1969)  loss_bbox_1: 0.0000 (0.0000)  loss_giou_1: 0.0000 (0.0000)  loss_contrastive_align_1: 0.0000 (0.0000)  loss_ce_2: 2.4855 (2.4855)  loss_bbox_2: 0.0000 (0.0000)  loss_giou_2: 0.0000 (0.0000)  loss_contrastive_align_2: 0.0000 (0.0000)  loss_ce_3: 2.5023 (2.5023)  loss_bbox_3: 0.0000 (0.0000)  loss_giou_3: 0.0000 (0.0000)  loss_contrastive_align_3: 0.0000 (0.0000)  loss_ce_4: 2.3826 (2.3826)  loss_bbox_4: 0.0000 (0.0000)  loss_giou_4: 0.0000 (0.0000)  loss_contrastive_align_4: 0.0000 (0.0000)  loss_ce_unscaled: 2.3089 (2.3089)  loss_bbox_unscaled: 0.0000 (0.0000)  loss_giou_unscaled: 0.0000 (0.0000)  cardinality_error_unscaled: 2.0000 (2.0000)  loss_contrastive_align_unscaled: 0.0000 (0.0000)  loss_ce_0_unscaled: 2.2728 (2.2728)  loss_bbox_0_unscaled: 0.0000 (0.0000)  loss_giou_0_unscaled: 0.0000 (0.0000)  cardinality_error_0_unscaled: 3.0000 (3.0000)  loss_contrastive_align_0_unscaled: 0.0000 (0.0000)  loss_ce_1_unscaled: 2.1969 (2.1969)  loss_bbox_1_unscaled: 0.0000 (0.0000)  loss_giou_1_unscaled: 0.0000 (0.0000)  cardinality_error_1_unscaled: 3.0000 (3.0000)  loss_contrastive_align_1_unscaled: 0.0000 (0.0000)  loss_ce_2_unscaled: 2.4855 (2.4855)  loss_bbox_2_unscaled: 0.0000 (0.0000)  loss_giou_2_unscaled: 0.0000 (0.0000)  cardinality_error_2_unscaled: 2.0000 (2.0000)  loss_contrastive_align_2_unscaled: 0.0000 (0.0000)  loss_ce_3_unscaled: 2.5023 (2.5023)  loss_bbox_3_unscaled: 0.0000 (0.0000)  loss_giou_3_unscaled: 0.0000 (0.0000)  cardinality_error_3_unscaled: 2.0000 (2.0000)  loss_contrastive_align_3_unscaled: 0.0000 (0.0000)  loss_ce_4_unscaled: 2.3826 (2.3826)  loss_bbox_4_unscaled: 0.0000 (0.0000)  loss_giou_4_unscaled: 0.0000 (0.0000)  cardinality_error_4_unscaled: 2.0000 (2.0000)  loss_contrastive_align_4_unscaled: 0.0000 (0.0000)  time: 1.8194  data: 1.2582  max mem: 4014
Traceback (most recent call last):
  File "main.py", line 643, in <module>
    main(args)
  File "main.py", line 546, in main
    train_stats = train_one_epoch(
  File "/home/pchelintsev/MDETR_untouched/mdetr/engine.py", line 73, in train_one_epoch
    loss_dict.update(criterion(outputs, targets, positive_map))
  File "/home/pchelintsev/.conda/envs/mdetr_env3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pchelintsev/MDETR_untouched/mdetr/models/mdetr.py", line 679, in forward
    losses.update(self.get_loss(loss, outputs, targets, positive_map, indices, num_boxes))
  File "/home/pchelintsev/MDETR_untouched/mdetr/models/mdetr.py", line 655, in get_loss
    return loss_map[loss](outputs, targets, positive_map, indices, num_boxes, **kwargs)
  File "/home/pchelintsev/MDETR_untouched/mdetr/models/mdetr.py", line 487, in loss_labels
    eos_coef[src_idx] = 1
RuntimeError: linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor61

So, as it was suggested in the other issue, I run it on CPU and it worked!

Starting epoch 0
/home/pchelintsev/.conda/envs/mdetr_env3/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  ../aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
Epoch: [0]  [    0/73902]  eta: 25 days, 11:34:41  lr: 0.000100  lr_backbone: 0.000010  lr_text_encoder: 0.000000  loss: 14.3907 (14.3907)  loss_ce: 2.4076 (2.4076)  loss_bbox: 0.0000 (0.0000)  loss_giou: 0.0000 (0.0000)  loss_contrastive_align: 0.0000 (0.0000)  loss_ce_0: 2.4669 (2.4669)  loss_bbox_0: 0.0000 (0.0000)  loss_giou_0: 0.0000 (0.0000)  loss_contrastive_align_0: 0.0000 (0.0000)  loss_ce_1: 2.2301 (2.2301)  loss_bbox_1: 0.0000 (0.0000)  loss_giou_1: 0.0000 (0.0000)  loss_contrastive_align_1: 0.0000 (0.0000)  loss_ce_2: 2.5516 (2.5516)  loss_bbox_2: 0.0000 (0.0000)  loss_giou_2: 0.0000 (0.0000)  loss_contrastive_align_2: 0.0000 (0.0000)  loss_ce_3: 2.3101 (2.3101)  loss_bbox_3: 0.0000 (0.0000)  loss_giou_3: 0.0000 (0.0000)  loss_contrastive_align_3: 0.0000 (0.0000)  loss_ce_4: 2.4244 (2.4244)  loss_bbox_4: 0.0000 (0.0000)  loss_giou_4: 0.0000 (0.0000)  loss_contrastive_align_4: 0.0000 (0.0000)  loss_ce_unscaled: 2.4076 (2.4076)  loss_bbox_unscaled: 0.0000 (0.0000)  loss_giou_unscaled: 0.0000 (0.0000)  cardinality_error_unscaled: 3.0000 (3.0000)  loss_contrastive_align_unscaled: 0.0000 (0.0000)  loss_ce_0_unscaled: 2.4669 (2.4669)  loss_bbox_0_unscaled: 0.0000 (0.0000)  loss_giou_0_unscaled: 0.0000 (0.0000)  cardinality_error_0_unscaled: 3.0000 (3.0000)  loss_contrastive_align_0_unscaled: 0.0000 (0.0000)  loss_ce_1_unscaled: 2.2301 (2.2301)  loss_bbox_1_unscaled: 0.0000 (0.0000)  loss_giou_1_unscaled: 0.0000 (0.0000)  cardinality_error_1_unscaled: 2.0000 (2.0000)  loss_contrastive_align_1_unscaled: 0.0000 (0.0000)  loss_ce_2_unscaled: 2.5516 (2.5516)  loss_bbox_2_unscaled: 0.0000 (0.0000)  loss_giou_2_unscaled: 0.0000 (0.0000)  cardinality_error_2_unscaled: 3.0000 (3.0000)  loss_contrastive_align_2_unscaled: 0.0000 (0.0000)  loss_ce_3_unscaled: 2.3101 (2.3101)  loss_bbox_3_unscaled: 0.0000 (0.0000)  loss_giou_3_unscaled: 0.0000 (0.0000)  cardinality_error_3_unscaled: 2.5000 (2.5000)  loss_contrastive_align_3_unscaled: 0.0000 (0.0000)  loss_ce_4_unscaled: 2.4244 (2.4244)  loss_bbox_4_unscaled: 0.0000 (0.0000)  loss_giou_4_unscaled: 0.0000 (0.0000)  cardinality_error_4_unscaled: 3.0000 (3.0000)  loss_contrastive_align_4_unscaled: 0.0000 (0.0000)  time: 29.7919  data: 1.1390  max mem: 0
Epoch: [0]  [   10/73902]  eta: 18 days, 6:11:53  lr: 0.000100  lr_backbone: 0.000010  lr_text_encoder: 0.000000  loss: 40.6416 (50.0278)  loss_ce: 3.9509 (4.6128)  loss_bbox: 0.2337 (0.4157)  loss_giou: 0.5140 (0.6735)  loss_contrastive_align: 1.1038 (2.0033)  loss_ce_0: 6.8205 (5.5707)  loss_bbox_0: 0.1225 (0.3284)  loss_giou_0: 0.3370 (0.6317)  loss_contrastive_align_0: 1.8220 (2.5607)  loss_ce_1: 5.9798 (5.2364)  loss_bbox_1: 0.2626 (0.3924)  loss_giou_1: 0.4373 (0.7185)  loss_contrastive_align_1: 1.6724 (2.5045)  loss_ce_2: 4.3847 (5.0343)  loss_bbox_2: 0.2473 (0.3728)  loss_giou_2: 0.5318 (0.6514)  loss_contrastive_align_2: 1.0731 (2.3479)  loss_ce_3: 4.0940 (4.8984)  loss_bbox_3: 0.2544 (0.4026)  loss_giou_3: 0.5044 (0.6831)  loss_contrastive_align_3: 1.0696 (2.1846)  loss_ce_4: 3.9194 (4.6899)  loss_bbox_4: 0.2297 (0.4037)  loss_giou_4: 0.4369 (0.6624)  loss_contrastive_align_4: 1.0977 (2.0480)  loss_ce_unscaled: 3.9509 (4.6128)  loss_bbox_unscaled: 0.0467 (0.0831)  loss_giou_unscaled: 0.2570 (0.3368)  cardinality_error_unscaled: 1.0000 (1.0909)  loss_contrastive_align_unscaled: 1.1038 (2.0033)  loss_ce_0_unscaled: 6.8205 (5.5707)  loss_bbox_0_unscaled: 0.0245 (0.0657)  loss_giou_0_unscaled: 0.1685 (0.3159)  cardinality_error_0_unscaled: 1.0000 (1.5000)  loss_contrastive_align_0_unscaled: 1.8220 (2.5607)  loss_ce_1_unscaled: 5.9798 (5.2364)  loss_bbox_1_unscaled: 0.0525 (0.0785)  loss_giou_1_unscaled: 0.2186 (0.3593)  cardinality_error_1_unscaled: 1.0000 (1.1818)  loss_contrastive_align_1_unscaled: 1.6724 (2.5045)  loss_ce_2_unscaled: 4.3847 (5.0343)  loss_bbox_2_unscaled: 0.0495 (0.0746)  loss_giou_2_unscaled: 0.2659 (0.3257)  cardinality_error_2_unscaled: 1.0000 (1.2273)  loss_contrastive_align_2_unscaled: 1.0731 (2.3479)  loss_ce_3_unscaled: 4.0940 (4.8984)  loss_bbox_3_unscaled: 0.0509 (0.0805)  loss_giou_3_unscaled: 0.2522 (0.3415)  cardinality_error_3_unscaled: 1.0000 (1.1818)  loss_contrastive_align_3_unscaled: 1.0696 (2.1846)  loss_ce_4_unscaled: 3.9194 (4.6899)  loss_bbox_4_unscaled: 0.0459 (0.0807)  loss_giou_4_unscaled: 0.2184 (0.3312)  cardinality_error_4_unscaled: 1.0000 (1.0909)  loss_contrastive_align_4_unscaled: 1.0977 (2.0480)  time: 21.3489  data: 0.1094  max mem: 0
Epoch: [0]  [   20/73902]  eta: 18 days, 16:13:48  lr: 0.000100  lr_backbone: 0.000010  lr_text_encoder: 0.000000  loss: 34.8511 (46.1291)  loss_ce: 2.4758 (5.0781)  loss_bbox: 0.2299 (0.3856)  loss_giou: 0.4293 (0.7006)  loss_contrastive_align: 0.4411 (1.4506)  loss_ce_0: 4.1652 (5.0743)  loss_bbox_0: 0.0811 (0.3611)  loss_giou_0: 0.1932 (0.6737)  loss_contrastive_align_0: 0.3715 (1.6258)  loss_ce_1: 2.6043 (5.1180)  loss_bbox_1: 0.1499 (0.3908)  loss_giou_1: 0.3773 (0.7349)  loss_contrastive_align_1: 0.3961 (1.6426)  loss_ce_2: 2.6675 (5.0676)  loss_bbox_2: 0.1963 (0.3785)  loss_giou_2: 0.3574 (0.6974)  loss_contrastive_align_2: 0.3626 (1.5843)  loss_ce_3: 2.6249 (5.0039)  loss_bbox_3: 0.1436 (0.3871)  loss_giou_3: 0.4561 (0.6985)  loss_contrastive_align_3: 0.3725 (1.5028)  loss_ce_4: 2.5412 (5.0421)  loss_bbox_4: 0.1969 (0.3801)  loss_giou_4: 0.4178 (0.6954)  loss_contrastive_align_4: 0.4074 (1.4550)  loss_ce_unscaled: 2.4758 (5.0781)  loss_bbox_unscaled: 0.0460 (0.0771)  loss_giou_unscaled: 0.2146 (0.3503)  cardinality_error_unscaled: 1.0000 (1.1429)  loss_contrastive_align_unscaled: 0.4411 (1.4506)  loss_ce_0_unscaled: 4.1652 (5.0743)  loss_bbox_0_unscaled: 0.0162 (0.0722)  loss_giou_0_unscaled: 0.0966 (0.3368)  cardinality_error_0_unscaled: 1.0000 (1.4286)  loss_contrastive_align_0_unscaled: 0.3715 (1.6258)  loss_ce_1_unscaled: 2.6043 (5.1180)  loss_bbox_1_unscaled: 0.0300 (0.0782)  loss_giou_1_unscaled: 0.1886 (0.3675)  cardinality_error_1_unscaled: 1.0000 (1.2143)  loss_contrastive_align_1_unscaled: 0.3961 (1.6426)  loss_ce_2_unscaled: 2.6675 (5.0676)  loss_bbox_2_unscaled: 0.0393 (0.0757)  loss_giou_2_unscaled: 0.1787 (0.3487)  cardinality_error_2_unscaled: 1.0000 (1.2857)  loss_contrastive_align_2_unscaled: 0.3626 (1.5843)  loss_ce_3_unscaled: 2.6249 (5.0039)  loss_bbox_3_unscaled: 0.0287 (0.0774)  loss_giou_3_unscaled: 0.2281 (0.3492)  cardinality_error_3_unscaled: 1.0000 (1.2143)  loss_contrastive_align_3_unscaled: 0.3725 (1.5028)  loss_ce_4_unscaled: 2.5412 (5.0421)  loss_bbox_4_unscaled: 0.0394 (0.0760)  loss_giou_4_unscaled: 0.2089 (0.3477)  cardinality_error_4_unscaled: 1.0000 (1.1905)  loss_contrastive_align_4_unscaled: 0.4074 (1.4550)  time: 21.4431  data: 0.0061  max mem: 0

What can be wrong?((( Also, I've made sure that transformers version is 4.5.1

Flaick commented 2 years ago

I did not encounter that error when running on the GPU. I deploy it with slurm on 8 2080 ti cards, and I am not sure what is happening here. Sorry that I can not help with that.

TopCoder2K commented 2 years ago

Hmm, that's strange... We have to have the same libraries. What CUDA version do you have?

I hope @alcinos can help! Here is the necessary info:

TopCoder2K commented 2 years ago

Also, I tried to run fine-tuning in docker with CUDA 10.2 and CUDA 11.1. Again, it works on the CPU but I still get the same mistake on the GPU :( What did I run to setup the environments?

conda init
bash
conda create -n mdetr_env python=3.8
conda activate mdetr_env
pip install numpy
pip install -r requirements.txt

numpy is needed because pycocotools uses it (I got an error without numpy installed). Also, maybe it's worth pointing out that pycocotools ''was installed using the legacy 'setup.py install' method, because a wheel could not be built for it''. conda list gives:

_libgcc_mutex             0.1                        main  
_openmp_mutex             4.5                       1_gnu  
ca-certificates           2021.10.26           h06a4308_2  
certifi                   2021.10.8        py38h06a4308_2  
charset-normalizer        2.0.10                   pypi_0    pypi
click                     8.0.3                    pypi_0    pypi
cloudpickle               2.0.0                    pypi_0    pypi
cycler                    0.11.0                   pypi_0    pypi
cython                    0.29.26                  pypi_0    pypi
filelock                  3.4.2                    pypi_0    pypi
flatbuffers               2.0                      pypi_0    pypi
fonttools                 4.29.0                   pypi_0    pypi
idna                      3.3                      pypi_0    pypi
joblib                    1.1.0                    pypi_0    pypi
kiwisolver                1.3.2                    pypi_0    pypi
ld_impl_linux-64          2.35.1               h7274673_9  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 9.3.0               h5101ec6_17  
libgomp                   9.3.0               h5101ec6_17  
libstdcxx-ng              9.3.0               hd4cf53a_17  
matplotlib                3.5.1                    pypi_0    pypi
ncurses                   6.3                  h7f8727e_2  
numpy                     1.22.1                   pypi_0    pypi
onnx                      1.10.2                   pypi_0    pypi
onnxruntime               1.10.0                   pypi_0    pypi
openssl                   1.1.1m               h7f8727e_0  
packaging                 21.3                     pypi_0    pypi
panopticapi               0.1                      pypi_0    pypi
pillow                    9.0.0                    pypi_0    pypi
pip                       21.2.4           py38h06a4308_0  
prettytable               3.0.0                    pypi_0    pypi
protobuf                  3.19.3                   pypi_0    pypi
pycocotools               2.0                      pypi_0    pypi
pyparsing                 3.0.7                    pypi_0    pypi
python                    3.8.12               h12debd9_0  
python-dateutil           2.8.2                    pypi_0    pypi
readline                  8.1.2                h7f8727e_1  
regex                     2022.1.18                pypi_0    pypi
requests                  2.27.1                   pypi_0    pypi
sacremoses                0.0.47                   pypi_0    pypi
scipy                     1.7.3                    pypi_0    pypi
setuptools                58.0.4           py38h06a4308_0  
six                       1.16.0                   pypi_0    pypi
sqlite                    3.37.0               hc218d9a_0  
submitit                  1.4.1                    pypi_0    pypi
timm                      0.5.4                    pypi_0    pypi
tk                        8.6.11               h1ccaba5_0  
tokenizers                0.10.3                   pypi_0    pypi
torch                     1.9.1                    pypi_0    pypi
torchvision               0.10.1                   pypi_0    pypi
tqdm                      4.62.3                   pypi_0    pypi
transformers              4.5.1                    pypi_0    pypi
typing-extensions         4.0.1                    pypi_0    pypi
urllib3                   1.26.8                   pypi_0    pypi
wcwidth                   0.2.5                    pypi_0    pypi
wheel                     0.37.1             pyhd3eb1b0_0  
xmltodict                 0.12.0                   pypi_0    pypi
xz                        5.2.5                h7b6447c_0  
zlib                      1.2.11               h7f8727e_4

transformers=4.5.1, so I have no idea why the mistake occurs. Maybe, I should try the good old 'print' method, print all the sizes in the hope of noticing something wrong.

TopCoder2K commented 2 years ago

Oh, there is no error with python=3.7.10, torch=1.8.1, torchvision=0.9.1, CUDA=11.1, transformers=4.5.1! With the recommended python=3.8 it also works (I'm using python=3.8.12)