Training seems to be stuck after annotations are loaded into memory

19Timotei97 commented 2 years ago

I am trying to train YOLOX-L model on a smaller version of COCO (only images with people, bikes, cars, trucks) - stuff that you would encounter with a driverless car. The problem is that even with a setup like 4 or 6 x V100 GPUs, which is enough for this training, the process seems to be stuck for hours (and not reach the epochs) after the annotations are loaded and the index is created, as seen below:

2022-02-11 10:22:32.743 | INFO | yolox.core.launch:_distributed_worker:116 - Rank 5 initialization finished. 2022-02-11 10:22:32.760 | INFO | yolox.core.launch:_distributed_worker:116 - Rank 1 initialization finished. 2022-02-11 10:22:32.764 | INFO | yolox.core.launch:_distributed_worker:116 - Rank 0 initialization finished. 2022-02-11 10:22:32.792 | INFO | yolox.core.launch:_distributed_worker:116 - Rank 2 initialization finished. 2022-02-11 10:22:32.796 | INFO | yolox.core.launch:_distributed_worker:116 - Rank 4 initialization finished. 2022-02-11 10:22:32.797 | INFO | yolox.core.launch:_distributed_worker:116 - Rank 3 initialization finished. 2022-02-11 10:22:45.931 | INFO | yolox.utils.setup_env:configure_omp:46 -

We set OMP_NUM_THREADS for each process to 1 to speed up. please further tune the variable for optimal performance.

2022-02-11 10:22:45 | INFO | yolox.core.trainer:127 - args: Namespace(batch_size=16, cache=False, ckpt=None, devices=6, dist_backend='nccl', dist_url=None, exp_file='YOLOX/exps/example/custom/yolox_l.py', experiment_name='yolox_l', fp16=False, machine_rank=0, name='yolox-s', num_machines=1, occupy=False, opts=[], resume=False, start_epoch=None) 2022-02-11 10:22:45 | INFO | yolox.core.trainer:128 - exp value: ╒══════════════════╤═══════════════════════════════════════════════════════════════════════════════╕ │ keys │ values │ ╞══════════════════╪═══════════════════════════════════════════════════════════════════════════════╡ │ seed │ None │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ output_dir │ './YOLOX_outputs' │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ print_interval │ 10 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ eval_interval │ 1 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ num_classes │ 80 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ depth │ 1.0 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ width │ 1.0 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ act │ 'silu' │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ data_num_workers │ 8 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ input_size │ (640, 640) │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ multiscale_range │ 5 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ data_dir │ 'YOLOX/datasets/COCO_2017_obstacle_detection' │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ train_ann │ 'instances_train2017.json' │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ val_ann │ 'instances_val2017.json' │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ mosaic_prob │ 1.0 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ mixup_prob │ 1.0 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ hsv_prob │ 1.0 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ flip_prob │ 0.5 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ degrees │ 10.0 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ translate │ 0.1 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ mosaic_scale │ (0.1, 2) │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ mixup_scale │ (0.5, 1.5) │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ shear │ 2.0 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ enable_mixup │ True │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ warmup_epochs │ 3 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ max_epoch │ 300 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ warmup_lr │ 0 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ basic_lr_per_img │ 0.0001 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ scheduler │ 'yoloxwarmcos' │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ no_aug_epochs │ 15 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ min_lr_ratio │ 0.05 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ ema │ True │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ weight_decay │ 0.0005 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ momentum │ 0.9 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ exp_name │ 'yolox_l' │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ test_size │ (640, 640) │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ test_conf │ 0.01 │ ├──────────────────┼───────────────────────────────────────────────────────────────────────────────┤ │ nmsthre │ 0.65 │ ╘══════════════════╧═══════════════════════════════════════════════════════════════════════════════╛ 2022-02-11 10:23:29 | INFO | yolox.core.trainer:134 - Model Summary: Params: 54.21M, Gflops: 155.65 2022-02-11 10:23:29 | INFO | yolox.data.datasets.coco:66 - loading annotations into memory... 2022-02-11 10:23:37 | INFO | yolox.data.datasets.coco:66 - Done (t=7.81s) 2022-02-11 10:23:37 | INFO | pycocotools.coco:86 - creating index... 2022-02-11 10:23:38 | INFO | pycocotools.coco:86 - index created!

Do you have any idea? :) I tried changing the num_workes to more than 4, (8 or 16), but that doesnt seem to change anything. Also, the dataset is in COCO format, only extracting images with the classes of interest and annotations for those images.

One more important information: I'm training from scratch, without using any ckpt file, could the data augmentation slow down the training so much or is it completely stuck?

FateScript commented 2 years ago

Plz change your num_workers to 0(or value smaller than 4) and watch your memory usage (htop might help you). Please let us know if it still stucks.

19Timotei97 commented 2 years ago

Unfortunately, even after changing with your suggestion, the training was stuck and then it was killed by the Scheduler system that we use. Any ideas? :)

Joker316701882 commented 1 year ago

@19Timotei97 Hi. It is because of the memory issue used for cached data. We have recently fixed this issue in PR #1584 . Feel free to try it!

hoangdung3498 commented 1 year ago

@19Timotei97 Hi. It is because of the memory issue used for cached data. We have recently fixed this issue in PR #1584 . Feel free to try it!

i'm newbee, so can you show more detail about how to fix that ???

hoangdung3498 commented 1 year ago

@19Timotei97 Hi. It is because of the memory issue used for cached data. We have recently fixed this issue in PR #1584 . Feel free to try it!

i'm newbee, so can you show more detail about how to fix that ???

Megvii-BaseDetection / YOLOX

Training seems to be stuck after annotations are loaded into memory #1120