eval_cityscape stuck forever

snowmint commented 3 years ago

I run the command python -m runx.runx scripts/eval_cityscapes.yml -i, after 18 hours one of my computer's CPU core still have 100% usage, and under ./logs/eval_cityscapes didn't produce any result.

Finally, I force quit the command with ctrl+C, but logging.log didn't show any error message. cat logging.log

Torch version: 1.3, 1.3.0a0+24ae9b5
n scales [0.5, 1.0, 2.0]
dataset = cityscapes
ignore_label = 255
num_classes = 19
cv split val 0 ['val/munster', 'val/frankfurt', 'val/lindau']
mode val found 500 images
cn num_classes 19
Using Cross Entropy Loss
Using Cross Entropy Loss
Loading weights from: checkpoint=/home/large_asset_dir/seg_weights/cityscapes_ocrnet.HRNet_Mscale_outstanding-turtle.pth
=> init weights from normal distribution
=> loading pretrained model /home/large_asset_dir/seg_weights/hrnetv2_w48_imagenet_pretrained.pth
Trunk: hrnetv2
Model params = 72.1M

My directory of Cityscapes dataset was set as below, and I also have been modified the config file:

/home/large_asset_dir/data/Cityscapes# ls
    |--README
    |--gtFine_trainvaltest
    |--leftImg8bit_trainvaltest
    |--license.txt

I can't even get a single word of error to determine where's going wrong, have anyone face to the same problem? or any suggestion to help me figure out the trouble?

ajtao commented 3 years ago

Hi @snowmint, this code is hardcoded to require cuda/GPU.

snowmint commented 3 years ago

Hi, @ajtao, thank you for the quick reply.

I have set the CUDA environment and can run inference normally.

Sorry about that I forgot to provide my system environment information, I'll show it below: Host OS: Ubuntu 18.04 Docker: 20.10.5 NVIDIA Docker: 2.5.0 Container: Ubuntu 18.04 use nvcr.io/nvidia/pytorch:19.10-py3 as base

/home/semantic-segmentation# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
/home/semantic-segmentation# nvidia-smi
Fri Mar 12 00:00:41 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.56       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
| 23%   35C    P8    12W / 250W |    363MiB / 11175MiB |      8%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

I can run the code with dump_folder.yml and dump_cityscape.yml, and get the result: Although got the result, it seems that dump_cityscape is run with CPU and costs almost 4 hours to get this IoU report, it's a little bit slow. Have I lost some setting?

...
validating[Iter: 3441 / 3475]
validating[Iter: 3461 / 3475]
IoU:
  Id  label            iU_1.0     TP    FP    FN    Precision    Recall
----  -------------  --------  -----  ----  ----  -----------  --------
   0  road              99.28  36.83  0.00  0.00         1.00      1.00
   1  sidewalk          94.78   5.83  0.03  0.03         0.97      0.97
   2  building          96.35  22.26  0.02  0.02         0.98      0.98
   3  wall              87.33   0.62  0.07  0.08         0.94      0.93
   4  fence             88.41   0.82  0.06  0.07         0.94      0.94
   5  pole              77.92   1.10  0.15  0.13         0.87      0.89
   6  traffic light     82.02   0.19  0.10  0.12         0.91      0.89
   7  traffic sign      89.55   0.53  0.06  0.05         0.94      0.95
   8  vegetation        95.12  15.75  0.02  0.03         0.98      0.97
   9  terrain           87.67   1.04  0.07  0.07         0.93      0.94
  10  sky               96.65   3.86  0.02  0.02         0.98      0.98
  11  person            91.19   1.18  0.05  0.05         0.96      0.95
  12  rider             83.11   0.13  0.10  0.11         0.91      0.90
  13  car               97.44   6.84  0.01  0.01         0.99      0.99
  14  truck             96.21   0.27  0.02  0.02         0.98      0.98
  15  bus               96.72   0.25  0.02  0.02         0.98      0.98
  16  train             96.30   0.21  0.02  0.02         0.98      0.98
  17  motorcycle        87.38   0.09  0.07  0.08         0.94      0.93
  18  bicycle           86.49   0.43  0.07  0.09         0.93      0.92
Mean: 91.05
-----------------------------------------------------------------------------------------------------------
this : [epoch 0], [val loss 0.07421], [acc 0.98221], [acc_cls 0.95319], [mean_iu 0.91049], [fwavacc 0.96566]
best : [epoch 0], [val loss 0.07421], [acc 0.98221], [acc_cls 0.95319], [mean_iu 0.91049], [fwavacc 0.96566]
-----------------------------------------------------------------------------------------------------------

fjremnav commented 3 years ago

I have encountered the same stuck issue of running eval_cityscapes.yml. Do you resolve this issue?

Thanks,

snowmint commented 3 years ago

@fjremnav

I have changed the batch size to 1 and modified the command to CMD: "python -m torch.distributed.launch --nproc_per_node=1 train.py" in eval_cityscapes.yml then it can run normally.

But I still have a question on evaluation, my CPU Usage that always so high (I have turned off everything except evaluate program), and it will take a very long time to evaluate.

fjremnav commented 3 years ago

@snowmint

Thanks for the info and I will give it a try.

FuzhiYang commented 2 years ago

I trained the model with my own dataset. In the eval process after epoch 0 and epoch 2, the training runs normally. But the training is stuck in epoch 4.

I observed that during the training process, the utilized cpu memory is slowly increasing. Maybe this will cause the evaluation stuck after a certain epoch.

NVIDIA / semantic-segmentation

eval_cityscape stuck forever #132