Open snowmint opened 3 years ago
Hi @snowmint, this code is hardcoded to require cuda/GPU.
Hi, @ajtao, thank you for the quick reply.
I have set the CUDA environment and can run inference normally.
Sorry about that I forgot to provide my system environment information, I'll show it below:
Host OS: Ubuntu 18.04
Docker: 20.10.5
NVIDIA Docker: 2.5.0
Container: Ubuntu 18.04 use nvcr.io/nvidia/pytorch:19.10-py3 as base
/home/semantic-segmentation# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
/home/semantic-segmentation# nvidia-smi
Fri Mar 12 00:00:41 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 On | N/A |
| 23% 35C P8 12W / 250W | 363MiB / 11175MiB | 8% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
I can run the code with dump_folder.yml
and dump_cityscape.yml
, and get the result:
Although got the result, it seems that dump_cityscape is run with CPU and costs almost 4 hours to get this IoU report, it's a little bit slow. Have I lost some setting?
...
validating[Iter: 3441 / 3475]
validating[Iter: 3461 / 3475]
IoU:
Id label iU_1.0 TP FP FN Precision Recall
---- ------------- -------- ----- ---- ---- ----------- --------
0 road 99.28 36.83 0.00 0.00 1.00 1.00
1 sidewalk 94.78 5.83 0.03 0.03 0.97 0.97
2 building 96.35 22.26 0.02 0.02 0.98 0.98
3 wall 87.33 0.62 0.07 0.08 0.94 0.93
4 fence 88.41 0.82 0.06 0.07 0.94 0.94
5 pole 77.92 1.10 0.15 0.13 0.87 0.89
6 traffic light 82.02 0.19 0.10 0.12 0.91 0.89
7 traffic sign 89.55 0.53 0.06 0.05 0.94 0.95
8 vegetation 95.12 15.75 0.02 0.03 0.98 0.97
9 terrain 87.67 1.04 0.07 0.07 0.93 0.94
10 sky 96.65 3.86 0.02 0.02 0.98 0.98
11 person 91.19 1.18 0.05 0.05 0.96 0.95
12 rider 83.11 0.13 0.10 0.11 0.91 0.90
13 car 97.44 6.84 0.01 0.01 0.99 0.99
14 truck 96.21 0.27 0.02 0.02 0.98 0.98
15 bus 96.72 0.25 0.02 0.02 0.98 0.98
16 train 96.30 0.21 0.02 0.02 0.98 0.98
17 motorcycle 87.38 0.09 0.07 0.08 0.94 0.93
18 bicycle 86.49 0.43 0.07 0.09 0.93 0.92
Mean: 91.05
-----------------------------------------------------------------------------------------------------------
this : [epoch 0], [val loss 0.07421], [acc 0.98221], [acc_cls 0.95319], [mean_iu 0.91049], [fwavacc 0.96566]
best : [epoch 0], [val loss 0.07421], [acc 0.98221], [acc_cls 0.95319], [mean_iu 0.91049], [fwavacc 0.96566]
-----------------------------------------------------------------------------------------------------------
I have encountered the same stuck issue of running eval_cityscapes.yml. Do you resolve this issue?
Thanks,
@fjremnav
I have changed the batch size to 1 and modified the command to CMD: "python -m torch.distributed.launch --nproc_per_node=1 train.py"
in eval_cityscapes.yml then it can run normally.
But I still have a question on evaluation, my CPU Usage that always so high (I have turned off everything except evaluate program), and it will take a very long time to evaluate.
@snowmint
Thanks for the info and I will give it a try.
I trained the model with my own dataset. In the eval process after epoch 0 and epoch 2, the training runs normally. But the training is stuck in epoch 4.
I observed that during the training process, the utilized cpu memory is slowly increasing. Maybe this will cause the evaluation stuck after a certain epoch.
I run the command
python -m runx.runx scripts/eval_cityscapes.yml -i
, after 18 hours one of my computer's CPU core still have 100% usage, and under ./logs/eval_cityscapes didn't produce any result.Finally, I force quit the command with
ctrl+C
, butlogging.log
didn't show any error message.cat logging.log
My directory of Cityscapes dataset was set as below, and I also have been modified the config file:
I can't even get a single word of error to determine where's going wrong, have anyone face to the same problem? or any suggestion to help me figure out the trouble?