boostcampaitech3 / level2-data-annotation_cv-level2-cv-09

level2-data-annotation_cv-level2-cv-09 created by GitHub Classroom
0 stars 4 forks source link

[Feat] Sweep #7

Open km9mn opened 2 years ago

km9mn commented 2 years ago

What?

Why?

Todo

km9mn commented 2 years ago

https://wandb.ai/level2-cv-09/data-annotation/sweeps/30czr85t?workspace=user-km9mn

km9mn commented 2 years ago

sweep시 trained_models에 latest.pth가 계속 갱신되어 마지막으로 학습한 모델만 저장되고 있음 모델별로 저장 or best 모델만 저장하는 방식으로 수정해야 함

km9mn commented 2 years ago

batch_size와 num_workers가 커지면 OOM이 뜨니 조정해야 함 image size를 uniform 분포로 줬을 때 error 다수 발생 -> 현재는

image_size:
    values:
    - 128
    - 256
    - 512
    - 1024
    - 2048
km9mn commented 2 years ago
        batch_size: 13
        data_dir: ../input/data/ICDAR17_Korean
        device: cuda
        expr_name: sweep
        image_size: 512
        input_size: 256
        learning_rate: 0.0003715485009341774
        max_epoch: 200
        model_dir: trained_models
        num_workers: 13
        optimizer: Adam
        save_interval: 5

위 세팅으로 89 epoch까지 돌다가 에러 발생

Traceback (most recent call last):
  File "train2.py", line 118, in <module>
    main(args)
  File "train2.py", line 111, in main
    do_training(**args.__dict__)
  File "train2.py", line 82, in do_training
    loss, extra_info = model.train_step(img, gt_score_map, gt_geo_map, roi_mask)
  File "/opt/ml/code/model.py", line 181, in train_step
    loss, values_dict = self.criterion(score_map, pred_score_map, geo_map, pred_geo_map,
  File "/opt/conda/lib/python3.8/site-packages/torch/tensor.py", line 588, in __iter__
    raise TypeError('iteration over a 0-d tensor')
TypeError: iteration over a 0-d tensor
yoonghee commented 2 years ago

--batch_size=13 --data_dir=../input/data/ICDAR17_Korean --device=cuda --expr_name=sweep_yoonghee --image_size=512 --input_size=256 --learning_rate=0.0003715485009341774 --max_epoch=200 --model_dir=pths --num_workers=13 --optimizer=Adam --save_interval=5

위 세팅으로 76 epoch에서 같은 에러가 발생되어서 종료되었습니다. -- 제출 결과(latest.pth) --

image
yoonghee commented 2 years ago
        batch_size: 13
        data_dir: ../input/data/ICDAR17_Korean
        device: cuda
        expr_name: sweep
        image_size: 512
        input_size: 256
        learning_rate: 0.0003715485009341774
        max_epoch: 200
        model_dir: trained_models
        num_workers: 13
        optimizer: Adam
        save_interval: 5

위 세팅으로 89 epoch까지 돌다가 에러 발생

Traceback (most recent call last):
  File "train2.py", line 118, in <module>
    main(args)
  File "train2.py", line 111, in main
    do_training(**args.__dict__)
  File "train2.py", line 82, in do_training
    loss, extra_info = model.train_step(img, gt_score_map, gt_geo_map, roi_mask)
  File "/opt/ml/code/model.py", line 181, in train_step
    loss, values_dict = self.criterion(score_map, pred_score_map, geo_map, pred_geo_map,
  File "/opt/conda/lib/python3.8/site-packages/torch/tensor.py", line 588, in __iter__
    raise TypeError('iteration over a 0-d tensor')
TypeError: iteration over a 0-d tensor

model.py - line 182 train.py - line 83 부분에서 extra_info, value_dict 정보가 없는 부분 예외처리하니

잘 학습 됩니다. 예외처리한 부분 github에 코드 commit 하면 될까요?