Open rgkannan676 opened 2 years ago
OS : Windows10 GPU : 2080Ti CUDA : 11.2 CUDANN : 8.2.1
Hi,
Seems like its failing in the for loop https://github.com/PaddlePaddle/PaddleDetection/blob/19478edce0d2b940fdb333159ee9e0f42a1d44bf/ppdet/engine/trainer.py#L520 after completeing 1 iteration without any error.
The execution reaches https://github.com/PaddlePaddle/PaddleDetection/blob/19478edce0d2b940fdb333159ee9e0f42a1d44bf/ppdet/engine/trainer.py#L530 without any issue but doesn't go to the next if condition https://github.com/PaddlePaddle/PaddleDetection/blob/19478edce0d2b940fdb333159ee9e0f42a1d44bf/ppdet/engine/trainer.py#L532
I checked the len(loader) and it is 1. So ideally it should complete the for loop and move to next line. Not sure why it fails.
Hi, It seems your CUDA VERSION is 11.6, but we haven't checked PaddlePaddle .whl built for CUDA11.2 runing on 11.6. Could you please check if the forward step executed correctly ?
Hi @zoooo0820, Thank you for your reply. I am currently using Cuda Toolkit 11.2 and installed PaddlePaddle for "CUDA Toolkit 11.2 with cuDNN v8.2.1" .
The forward step is executed successfully and see the below for value of outs variable in 'outs = self.model(data)'.
{'bbox': Tensor(shape=[100, 6], dtype=float32, place=CPUPlace, stop_gradient=False,
[[0. , 0.95238709 , 163.01693726, 81.64559937 , 198.24432373,
167.39353943],
[0. , 0.94864202 , 103.69308472, 45.28234863 , 126.47123718,
92.34967041 ],
[0. , 0.93114990 , 410.77569580, 84.69867706 , 502.04968262,
286.09646606],
[0. , 0.86811846 , 580.38507080, 113.47499084, 611.40753174,
200.90867615],
[56. , 0.86211467 , 74.17538452 , 121.53815460, 102.48110962,
153.12097168],
[33. , 0.85878396 , 158.66378784, 99.47405243 , 617.07495117,
340.79278564],
[0. , 0.82214695 , 267.17294312, 84.55513763 , 292.15731812,
167.21755981],
[0. , 0.78982729 , 348.59542847, 43.86344147 , 364.35952759,
97.94625092 ],
[0. , 0.76746190 , 506.55392456, 115.50983429, 595.37500000,
272.53643799],
[0. , 0.76479059 , 328.97283936, 38.41905975 , 345.94335938,
79.57720184 ],
[0. , 0.66306233 , 169.35107422, 47.09164047 , 177.86236572,
60.08278275 ],
[0. , 0.63529837 , 364.05261230, 57.54597092 , 380.31927490,
108.11462402],
[0. , 0.63385540 , 27.25297356 , 117.72573853, 59.92661285 ,
152.63989258],
[0. , 0.61775613 , 378.28048706, 39.14801407 , 394.20516968,
83.73554993 ],
[0. , 0.52536190 , 186.37814331, 44.48184204 , 198.98437500,
59.78129578 ],
[56. , 0.42762586 , 98.57673645 , 130.52426147, 115.63314819,
154.52008057],
[24. , 0.34061965 , 1.91031837 , 150.37460327, 36.59494781 ,
172.01226807],
[24. , 0.28557616 , 100.22119141, 153.69656372, 117.62846375,
167.25775146],
[0. , 0.27708492 , 464.59777832, 15.31473064 , 470.84313965,
32.88559341 ],
[24. , 0.27335307 , 65.50455475 , 135.86271667, 83.28864288 ,
153.69624329],
[0. , 0.25723633 , 278.11318970, 79.93159485 , 297.24136353,
167.95016479],
[24. , 0.24083048 , 55.61202240 , 152.50000000, 100.01781464,
172.92410278],
[0. , 0.21805932 , 279.19711304, 80.64604187 , 295.67074585,
107.86468506],
[0. , 0.19797659 , 504.30233765, 114.20022583, 553.80889893,
177.36087036],
[0. , 0.18388060 , 265.62255859, 82.04562378 , 298.82159424,
169.31526184],
[24. , 0.17146595 , 84.91261292 , 155.86296082, 102.34503174,
171.35951233],
[41. , 0.16824859 , 508.83779907, 274.14990234, 519.37933350,
282.82373047],
[56. , 0.15387134 , 27.25297356 , 117.72573853, 59.92661285 ,
152.63989258],
[2. , 0.15260814 , 621.40032959, 5.16224861 , 637.05194092,
9.21322823 ],
[0. , 0.12946725 , 239.86201477, 95.60108948 , 252.12431335,
115.53451538],
[0. , 0.12095736 , 267.54342651, 88.84803772 , 295.95217896,
168.18635559],
[0. , 0.11267159 , 265.10934448, 86.20228577 , 298.71273804,
170.46595764],
[56. , 0.11178289 , 96.55578613 , 128.70202637, 118.62936401,
162.42553711],
[0. , 0.10880505 , 582.94323730, 113.26557922, 611.49951172,
175.53730774],
[2. , 0.10859525 , 0. , 16.53050613 , 14.52909851 ,
23.79916000 ],
[33. , 0.10563771 , 368.06707764, 165.56274414, 611.18322754,
204.03463745],
[0. , 0.09884537 , 364.63833618, 57.85504532 , 380.69235229,
114.52586365],
[0. , 0.09510860 , 57.85502625 , 137.07394409, 98.92730713 ,
172.89193726],
[56. , 0.09373777 , 74.34246063 , 122.50324249, 104.22733307,
155.38175964],
[29. , 0.09371884 , 348.40737915, 62.41240311 , 355.60354614,
67.46076965 ],
[33. , 0.09282129 , 157.13171387, 87.86817932 , 619.11584473,
326.56201172],
[0. , 0.09127463 , 505.52392578, 115.86607361, 562.31152344,
271.79815674],
[56. , 0.08903752 , 20.60574722 , 135.21688843, 37.54605865 ,
150.46453857],
[24. , 0.08678816 , 4.23665810 , 135.34336853, 36.20613861 ,
152.94285583],
[0. , 0.08356526 , 379.86148071, 40.56462860 , 393.62051392,
81.53936005 ],
[26. , 0.08326706 , 84.91261292 , 155.86296082, 102.34503174,
171.35951233],
[0. , 0.08207747 , 582.83227539, 111.97151184, 610.94653320,
201.02879333],
[0. , 0.08070017 , 410.27426147, 18.80368042 , 416.80923462,
32.93857193 ],
[33. , 0.08046593 , 348.40737915, 62.41240311 , 355.60354614,
67.46076965 ],
[2. , 0.07974923 , 491.34378052, 0.67531943 , 506.34744263,
7.13704491 ],
[0. , 0.07541234 , 155.00167847, 121.06098938, 175.50897217,
162.40031433],
[45. , 0.06950981 , 508.83779907, 274.14990234, 519.37933350,
282.82373047],
[0. , 0.06857380 , 279.91683960, 80.59072113 , 296.86422729,
123.46126556],
[0. , 0.06644288 , 411.89114380, 88.08791351 , 496.98086548,
298.68270874],
[24. , 0.06579210 , 57.38172913 , 152.87834167, 91.19117737 ,
172.34333801],
[13. , 0.06562617 , 541.50823975, 16.37550926 , 567.86968994,
29.11607170 ],
[0. , 0.06278227 , 417.29626465, 87.95059967 , 483.93341064,
285.20736694],
[0. , 0.06231792 , 187.15985107, 44.91228104 , 197.96987915,
59.38371658 ],
[33. , 0.06120147 , 159.93621826, 87.80218506 , 620.00366211,
281.19601440],
[0. , 0.06084029 , 166.17408752, 82.07063293 , 195.87413025,
164.11593628],
[2. , 0.05905440 , 605.65051270, 5.36504650 , 619.39282227,
9.07478714 ],
[0. , 0.05722712 , 0. , 126.04899597, 8.50353718 ,
147.27384949],
[0. , 0.05692595 , 508.29989624, 118.50185394, 583.18402100,
270.97988892],
[13. , 0.05576024 , 497.66845703, 25.97649765 , 518.17468262,
30.66893387 ],
[0. , 0.05560281 , 279.95925903, 79.89441681 , 295.87161255,
114.19608307],
[0. , 0.05517075 , 280.04949951, 79.77035522 , 296.54510498,
145.41760254],
[24. , 0.05394541 , 20.91853714 , 135.88543701, 37.92250061 ,
152.18301392],
[2. , 0.05263206 , 424.52667236, 0. , 439.58990479,
4.05065966 ],
[0. , 0.05238270 , 352.33425903, 44.02976990 , 365.10067749,
79.64344025 ],
[0. , 0.05208762 , 506.73104858, 107.90383911, 598.93139648,
271.44659424],
[0. , 0.05184478 , 379.86734009, 39.31773758 , 395.26382446,
83.82485962 ],
[56. , 0.05011687 , 156.88832092, 123.81497192, 171.12413025,
157.19927979],
[0. , 0.05004956 , 505.50582886, 110.95537567, 559.44219971,
268.08361816],
[24. , 0.04983799 , 2.79379940 , 151.22576904, 25.45175552 ,
171.65582275],
[26. , 0.04907774 , 1.91031837 , 150.37460327, 36.59494781 ,
172.01226807],
[0. , 0.04811454 , 349.57623291, 51.46279907 , 367.45806885,
99.80699158 ],
[0. , 0.04789471 , 327.77239990, 38.27670288 , 345.53192139,
79.52001953 ],
[32. , 0.04785154 , 325.94613647, 168.70161438, 331.42800903,
172.15748596],
[0. , 0.04734908 , 352.08676147, 50.12598419 , 365.45071411,
92.97180939 ],
[24. , 0.04690786 , 81.91660309 , 152.35939026, 102.24404144,
171.32673645],
[56. , 0.04667836 , 37.21086121 , 131.02789307, 57.78130341 ,
152.23123169],
[24. , 0.04491711 , 58.02764511 , 153.03874207, 80.26491547 ,
172.24598694],
[0. , 0.04453458 , 478.98855591, 23.08139801 , 484.91421509,
32.63216782 ],
[0. , 0.04230860 , 422.13385010, 88.64218903 , 492.15478516,
282.76568604],
[26. , 0.04149212 , 101.58711243, 153.90850830, 118.32226562,
167.19729614],
[0. , 0.04106046 , 265.23681641, 87.43470764 , 285.05200195,
137.10748291],
[0. , 0.03906933 , 226.17948914, 83.91767883 , 251.87916565,
117.66175842],
[67. , 0.03879960 , 504.02099609, 144.37136841, 508.29077148,
148.25106812],
[0. , 0.03787024 , 195.74223328, 124.55506134, 211.21348572,
156.26644897],
[56. , 0.03746197 , 1.59692574 , 130.36569214, 37.51303864 ,
156.09533691],
[33. , 0.03709068 , 361.32479858, 95.98898315 , 506.31704712,
272.42892456],
[0. , 0.03638148 , 398.39410400, 25.05980873 , 404.40588379,
32.92620087 ],
[0. , 0.03631815 , 279.50268555, 81.48350525 , 287.15435791,
93.49650574 ],
[56. , 0.03562423 , 38.01062393 , 137.32843018, 57.58405685 ,
151.95956421],
[56. , 0.03515101 , 76.47907257 , 122.70375824, 97.49521637 ,
150.59005737],
[13. , 0.03484610 , 305.01419067, 26.07754517 , 326.15213013,
30.91025925 ],
[56. , 0.03460176 , 98.63943481 , 131.12774658, 116.00555420,
157.22882080],
[24. , 0.03453739 , 21.94862366 , 152.27066040, 37.23377609 ,
170.60235596],
[0. , 0.03412202 , 328.51235962, 38.36921692 , 344.26052856,
76.72988892 ],
[33. , 0.03346637 , 415.97900391, 87.17357635 , 543.17413330,
281.36849976]]), 'bbox_num': Tensor(shape=[1], dtype=int32, place=CPUPlace, stop_gradient=False,
[100])}
@rgkannan676 I just checked it in Windows environment similar with yours, but failed to reproduce this error. It is similar to a known bug on Windows #33341 and #3558 . Maybe there are some errors happened in dataloader but not catched by pybind. Sorry for this problem. We are looking forward to locating and fixing this bug as soon as possible.
Hi @zoooo0820 . Thank you for your reply.
I was able to complete the inference by breaking out of the loop as shown below.
for step_id, data in enumerate(loader):
self.status['step_id'] = step_id
# forward
outs = self.model(data)
for key in ['im_shape', 'scale_factor', 'im_id']:
outs[key] = data[key]
for key, value in outs.items():
if hasattr(value, 'numpy'):
outs[key] = value.numpy()
results.append(outs)
if (step_id + 1) == len(loader):
break
Can you please confirm if this is OK? I guess the 'len(loader)' should be equal to number of images or the number of loops it should execute. Am I right? will this work for batch of images?
I made the following changes for the for loop https://github.com/PaddlePaddle/PaddleDetection/blob/19478edce0d2b940fdb333159ee9e0f42a1d44bf/ppdet/engine/trainer.py#L374
1) reload self.loader for each loop. Otherwise it was failing on the 2nd epoch at "for step_id, data in enumerate(self.loader):" 2) added break statement as before.
for epoch_id in range(self.start_epoch, self.cfg.epoch):
self.status['mode'] = 'train'
self.status['epoch_id'] = epoch_id
self._compose_callback.on_epoch_begin(self.status)
self.loader = create('{}Reader'.format(self.mode.capitalize()))(
self.dataset, self.cfg.worker_num)
self.loader.dataset.set_epoch(epoch_id)
model.train()
iter_tic = time.time()
for step_id, data in enumerate(self.loader):
self.status['data_time'].update(time.time() - iter_tic)
self.status['step_id'] = step_id
profiler.add_profiler_step(profiler_options)
self._compose_callback.on_step_begin(self.status)
data['epoch_id'] = epoch_id
if self.cfg.get('fp16', False):
with amp.auto_cast(enable=self.cfg.use_gpu):
# model forward
outputs = model(data)
loss = outputs['loss']
# model backward
scaled_loss = scaler.scale(loss)
scaled_loss.backward()
# in dygraph mode, optimizer.minimize is equal to optimizer.step
scaler.minimize(self.optimizer, scaled_loss)
else:
# model forward
outputs = model(data)
loss = outputs['loss']
# model backward
loss.backward()
self.optimizer.step()
curr_lr = self.optimizer.get_lr()
self.lr.step()
if self.cfg.get('unstructured_prune'):
self.pruner.step()
self.optimizer.clear_grad()
self.status['learning_rate'] = curr_lr
if self._nranks < 2 or self._local_rank == 0:
self.status['training_staus'].update(outputs)
self.status['batch_time'].update(time.time() - iter_tic)
self._compose_callback.on_step_end(self.status)
if self.use_ema:
self.ema.update(self.model)
iter_tic = time.time()
if (step_id + 1) == len(self.loader):
break
Is this OK?
for step_id, data in enumerate(loader):
self.status['step_id'] = step_id
self._compose_callback.on_step_begin(self.status)
# forward
outs = self.model(data)
# update metrics
for metric in self._metrics:
metric.update(data, outs)
sample_num += data['im_id'].numpy().shape[0]
self._compose_callback.on_step_end(self.status)
if (step_id + 1) == len(loader):
break
Please advise if I can use this changes as a work around for this issue.
Hi, @rgkannan676
Sorry for late reply. It looks like there is no problem. If you want to check batch of images, you can use --infer_dir
. It works in my environment.
Btw, does this change works for inference and training in Windows?
Hi @zoooo0820 ,
Yes this works in my windows10 environment for both infer and training.
Hi all, Thank you for your work.
I installed PaddlePaddle and PaddleDetection using below steps : -- python -m pip install paddlepaddle-gpu==2.2.2.post112 -f https://www.paddlepaddle.org.cn/whl/windows/mkl/avx/stable.html -- git clone https://github.com/PaddlePaddle/PaddleDetection.git -- cd PaddleDetection -- cython-bbox - pip install -e git+https://github.com/samson-wang/cython_bbox.git#egg=cython-bbox -- pycocotools - pip install git+https://github.com/philferriere/cocoapi.git#subdirectory=PythonAPI -- pip install -r requirements.txt -- python setup.py install
python ppdet/modeling/tests/test_architectures.py
OK
(py_env) C:\Users\xjera\PaddleDetection>