PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.67k stars 2.87k forks source link

Windows 10 : Inference demo fails without an error #5254

Open rgkannan676 opened 2 years ago

rgkannan676 commented 2 years ago

Hi all, Thank you for your work.

OK

-  Tested the import

import paddle paddle.utils.run_check() Running verify PaddlePaddle program ... PaddlePaddle works well on 1 GPU. PaddlePaddle works well on 1 GPUs. PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now. print(paddle.version) 2.2.2

- Ran the Inference demo check, but fails without any error message.  The weights were successfully downloaded. Please see the below output. 

(py_env) C:\Users\PaddleDetection>python tools/infer.py -c configs/ppyolo/ppyolo_r50vd_dcn_1x_coco.yml -o use_gpu=true weights=https://paddledet.bj.bcebos.com/models/ppyolo_r50vd_dcn_1x_coco.pdparams --infer_img=demo/000000014439.jpg C:\Users\anaconda3\envs\py_env\lib\site-packages\win32\lib\pywintypes.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses import imp, sys, os W0224 15:53:00.975919 16344 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.6, Runtime API Version: 11.2 W0224 15:53:00.991541 16344 device_context.cc:465] device: 0, cuDNN Version: 8.2. [02/24 15:53:04] ppdet.utils.checkpoint INFO: Finish loading model weights: C:\Users/.cache/paddle/weights\ppyolo_r50vd_dcn_1x_coco.pdparams

(py_env) C:\Users\xjera\PaddleDetection>



Please advise what can be the issue here. Thanks
rgkannan676 commented 2 years ago

OS : Windows10 GPU : 2080Ti CUDA : 11.2 CUDANN : 8.2.1

rgkannan676 commented 2 years ago

Hi,

Seems like its failing in the for loop https://github.com/PaddlePaddle/PaddleDetection/blob/19478edce0d2b940fdb333159ee9e0f42a1d44bf/ppdet/engine/trainer.py#L520 after completeing 1 iteration without any error.

The execution reaches https://github.com/PaddlePaddle/PaddleDetection/blob/19478edce0d2b940fdb333159ee9e0f42a1d44bf/ppdet/engine/trainer.py#L530 without any issue but doesn't go to the next if condition https://github.com/PaddlePaddle/PaddleDetection/blob/19478edce0d2b940fdb333159ee9e0f42a1d44bf/ppdet/engine/trainer.py#L532

I checked the len(loader) and it is 1. So ideally it should complete the for loop and move to next line. Not sure why it fails.

zoooo0820 commented 2 years ago

Hi, It seems your CUDA VERSION is 11.6, but we haven't checked PaddlePaddle .whl built for CUDA11.2 runing on 11.6. Could you please check if the forward step executed correctly ?

rgkannan676 commented 2 years ago

Hi @zoooo0820, Thank you for your reply. I am currently using Cuda Toolkit 11.2 and installed PaddlePaddle for "CUDA Toolkit 11.2 with cuDNN v8.2.1" .

The forward step is executed successfully and see the below for value of outs variable in 'outs = self.model(data)'.

{'bbox': Tensor(shape=[100, 6], dtype=float32, place=CPUPlace, stop_gradient=False,
       [[0.          , 0.95238709  , 163.01693726, 81.64559937 , 198.24432373,
         167.39353943],
        [0.          , 0.94864202  , 103.69308472, 45.28234863 , 126.47123718,
         92.34967041 ],
        [0.          , 0.93114990  , 410.77569580, 84.69867706 , 502.04968262,
         286.09646606],
        [0.          , 0.86811846  , 580.38507080, 113.47499084, 611.40753174,
         200.90867615],
        [56.         , 0.86211467  , 74.17538452 , 121.53815460, 102.48110962,
         153.12097168],
        [33.         , 0.85878396  , 158.66378784, 99.47405243 , 617.07495117,
         340.79278564],
        [0.          , 0.82214695  , 267.17294312, 84.55513763 , 292.15731812,
         167.21755981],
        [0.          , 0.78982729  , 348.59542847, 43.86344147 , 364.35952759,
         97.94625092 ],
        [0.          , 0.76746190  , 506.55392456, 115.50983429, 595.37500000,
         272.53643799],
        [0.          , 0.76479059  , 328.97283936, 38.41905975 , 345.94335938,
         79.57720184 ],
        [0.          , 0.66306233  , 169.35107422, 47.09164047 , 177.86236572,
         60.08278275 ],
        [0.          , 0.63529837  , 364.05261230, 57.54597092 , 380.31927490,
         108.11462402],
        [0.          , 0.63385540  , 27.25297356 , 117.72573853, 59.92661285 ,
         152.63989258],
        [0.          , 0.61775613  , 378.28048706, 39.14801407 , 394.20516968,
         83.73554993 ],
        [0.          , 0.52536190  , 186.37814331, 44.48184204 , 198.98437500,
         59.78129578 ],
        [56.         , 0.42762586  , 98.57673645 , 130.52426147, 115.63314819,
         154.52008057],
        [24.         , 0.34061965  , 1.91031837  , 150.37460327, 36.59494781 ,
         172.01226807],
        [24.         , 0.28557616  , 100.22119141, 153.69656372, 117.62846375,
         167.25775146],
        [0.          , 0.27708492  , 464.59777832, 15.31473064 , 470.84313965,
         32.88559341 ],
        [24.         , 0.27335307  , 65.50455475 , 135.86271667, 83.28864288 ,
         153.69624329],
        [0.          , 0.25723633  , 278.11318970, 79.93159485 , 297.24136353,
         167.95016479],
        [24.         , 0.24083048  , 55.61202240 , 152.50000000, 100.01781464,
         172.92410278],
        [0.          , 0.21805932  , 279.19711304, 80.64604187 , 295.67074585,
         107.86468506],
        [0.          , 0.19797659  , 504.30233765, 114.20022583, 553.80889893,
         177.36087036],
        [0.          , 0.18388060  , 265.62255859, 82.04562378 , 298.82159424,
         169.31526184],
        [24.         , 0.17146595  , 84.91261292 , 155.86296082, 102.34503174,
         171.35951233],
        [41.         , 0.16824859  , 508.83779907, 274.14990234, 519.37933350,
         282.82373047],
        [56.         , 0.15387134  , 27.25297356 , 117.72573853, 59.92661285 ,
         152.63989258],
        [2.          , 0.15260814  , 621.40032959, 5.16224861  , 637.05194092,
         9.21322823  ],
        [0.          , 0.12946725  , 239.86201477, 95.60108948 , 252.12431335,
         115.53451538],
        [0.          , 0.12095736  , 267.54342651, 88.84803772 , 295.95217896,
         168.18635559],
        [0.          , 0.11267159  , 265.10934448, 86.20228577 , 298.71273804,
         170.46595764],
        [56.         , 0.11178289  , 96.55578613 , 128.70202637, 118.62936401,
         162.42553711],
        [0.          , 0.10880505  , 582.94323730, 113.26557922, 611.49951172,
         175.53730774],
        [2.          , 0.10859525  , 0.          , 16.53050613 , 14.52909851 ,
         23.79916000 ],
        [33.         , 0.10563771  , 368.06707764, 165.56274414, 611.18322754,
         204.03463745],
        [0.          , 0.09884537  , 364.63833618, 57.85504532 , 380.69235229,
         114.52586365],
        [0.          , 0.09510860  , 57.85502625 , 137.07394409, 98.92730713 ,
         172.89193726],
        [56.         , 0.09373777  , 74.34246063 , 122.50324249, 104.22733307,
         155.38175964],
        [29.         , 0.09371884  , 348.40737915, 62.41240311 , 355.60354614,
         67.46076965 ],
        [33.         , 0.09282129  , 157.13171387, 87.86817932 , 619.11584473,
         326.56201172],
        [0.          , 0.09127463  , 505.52392578, 115.86607361, 562.31152344,
         271.79815674],
        [56.         , 0.08903752  , 20.60574722 , 135.21688843, 37.54605865 ,
         150.46453857],
        [24.         , 0.08678816  , 4.23665810  , 135.34336853, 36.20613861 ,
         152.94285583],
        [0.          , 0.08356526  , 379.86148071, 40.56462860 , 393.62051392,
         81.53936005 ],
        [26.         , 0.08326706  , 84.91261292 , 155.86296082, 102.34503174,
         171.35951233],
        [0.          , 0.08207747  , 582.83227539, 111.97151184, 610.94653320,
         201.02879333],
        [0.          , 0.08070017  , 410.27426147, 18.80368042 , 416.80923462,
         32.93857193 ],
        [33.         , 0.08046593  , 348.40737915, 62.41240311 , 355.60354614,
         67.46076965 ],
        [2.          , 0.07974923  , 491.34378052, 0.67531943  , 506.34744263,
         7.13704491  ],
        [0.          , 0.07541234  , 155.00167847, 121.06098938, 175.50897217,
         162.40031433],
        [45.         , 0.06950981  , 508.83779907, 274.14990234, 519.37933350,
         282.82373047],
        [0.          , 0.06857380  , 279.91683960, 80.59072113 , 296.86422729,
         123.46126556],
        [0.          , 0.06644288  , 411.89114380, 88.08791351 , 496.98086548,
         298.68270874],
        [24.         , 0.06579210  , 57.38172913 , 152.87834167, 91.19117737 ,
         172.34333801],
        [13.         , 0.06562617  , 541.50823975, 16.37550926 , 567.86968994,
         29.11607170 ],
        [0.          , 0.06278227  , 417.29626465, 87.95059967 , 483.93341064,
         285.20736694],
        [0.          , 0.06231792  , 187.15985107, 44.91228104 , 197.96987915,
         59.38371658 ],
        [33.         , 0.06120147  , 159.93621826, 87.80218506 , 620.00366211,
         281.19601440],
        [0.          , 0.06084029  , 166.17408752, 82.07063293 , 195.87413025,
         164.11593628],
        [2.          , 0.05905440  , 605.65051270, 5.36504650  , 619.39282227,
         9.07478714  ],
        [0.          , 0.05722712  , 0.          , 126.04899597, 8.50353718  ,
         147.27384949],
        [0.          , 0.05692595  , 508.29989624, 118.50185394, 583.18402100,
         270.97988892],
        [13.         , 0.05576024  , 497.66845703, 25.97649765 , 518.17468262,
         30.66893387 ],
        [0.          , 0.05560281  , 279.95925903, 79.89441681 , 295.87161255,
         114.19608307],
        [0.          , 0.05517075  , 280.04949951, 79.77035522 , 296.54510498,
         145.41760254],
        [24.         , 0.05394541  , 20.91853714 , 135.88543701, 37.92250061 ,
         152.18301392],
        [2.          , 0.05263206  , 424.52667236, 0.          , 439.58990479,
         4.05065966  ],
        [0.          , 0.05238270  , 352.33425903, 44.02976990 , 365.10067749,
         79.64344025 ],
        [0.          , 0.05208762  , 506.73104858, 107.90383911, 598.93139648,
         271.44659424],
        [0.          , 0.05184478  , 379.86734009, 39.31773758 , 395.26382446,
         83.82485962 ],
        [56.         , 0.05011687  , 156.88832092, 123.81497192, 171.12413025,
         157.19927979],
        [0.          , 0.05004956  , 505.50582886, 110.95537567, 559.44219971,
         268.08361816],
        [24.         , 0.04983799  , 2.79379940  , 151.22576904, 25.45175552 ,
         171.65582275],
        [26.         , 0.04907774  , 1.91031837  , 150.37460327, 36.59494781 ,
         172.01226807],
        [0.          , 0.04811454  , 349.57623291, 51.46279907 , 367.45806885,
         99.80699158 ],
        [0.          , 0.04789471  , 327.77239990, 38.27670288 , 345.53192139,
         79.52001953 ],
        [32.         , 0.04785154  , 325.94613647, 168.70161438, 331.42800903,
         172.15748596],
        [0.          , 0.04734908  , 352.08676147, 50.12598419 , 365.45071411,
         92.97180939 ],
        [24.         , 0.04690786  , 81.91660309 , 152.35939026, 102.24404144,
         171.32673645],
        [56.         , 0.04667836  , 37.21086121 , 131.02789307, 57.78130341 ,
         152.23123169],
        [24.         , 0.04491711  , 58.02764511 , 153.03874207, 80.26491547 ,
         172.24598694],
        [0.          , 0.04453458  , 478.98855591, 23.08139801 , 484.91421509,
         32.63216782 ],
        [0.          , 0.04230860  , 422.13385010, 88.64218903 , 492.15478516,
         282.76568604],
        [26.         , 0.04149212  , 101.58711243, 153.90850830, 118.32226562,
         167.19729614],
        [0.          , 0.04106046  , 265.23681641, 87.43470764 , 285.05200195,
         137.10748291],
        [0.          , 0.03906933  , 226.17948914, 83.91767883 , 251.87916565,
         117.66175842],
        [67.         , 0.03879960  , 504.02099609, 144.37136841, 508.29077148,
         148.25106812],
        [0.          , 0.03787024  , 195.74223328, 124.55506134, 211.21348572,
         156.26644897],
        [56.         , 0.03746197  , 1.59692574  , 130.36569214, 37.51303864 ,
         156.09533691],
        [33.         , 0.03709068  , 361.32479858, 95.98898315 , 506.31704712,
         272.42892456],
        [0.          , 0.03638148  , 398.39410400, 25.05980873 , 404.40588379,
         32.92620087 ],
        [0.          , 0.03631815  , 279.50268555, 81.48350525 , 287.15435791,
         93.49650574 ],
        [56.         , 0.03562423  , 38.01062393 , 137.32843018, 57.58405685 ,
         151.95956421],
        [56.         , 0.03515101  , 76.47907257 , 122.70375824, 97.49521637 ,
         150.59005737],
        [13.         , 0.03484610  , 305.01419067, 26.07754517 , 326.15213013,
         30.91025925 ],
        [56.         , 0.03460176  , 98.63943481 , 131.12774658, 116.00555420,
         157.22882080],
        [24.         , 0.03453739  , 21.94862366 , 152.27066040, 37.23377609 ,
         170.60235596],
        [0.          , 0.03412202  , 328.51235962, 38.36921692 , 344.26052856,
         76.72988892 ],
        [33.         , 0.03346637  , 415.97900391, 87.17357635 , 543.17413330,
         281.36849976]]), 'bbox_num': Tensor(shape=[1], dtype=int32, place=CPUPlace, stop_gradient=False,
       [100])}
zoooo0820 commented 2 years ago

@rgkannan676 I just checked it in Windows environment similar with yours, but failed to reproduce this error. It is similar to a known bug on Windows #33341 and #3558 . Maybe there are some errors happened in dataloader but not catched by pybind. Sorry for this problem. We are looking forward to locating and fixing this bug as soon as possible.

rgkannan676 commented 2 years ago

Hi @zoooo0820 . Thank you for your reply.

I was able to complete the inference by breaking out of the loop as shown below.

        for step_id, data in enumerate(loader):
            self.status['step_id'] = step_id
            # forward
            outs = self.model(data)

            for key in ['im_shape', 'scale_factor', 'im_id']:
                outs[key] = data[key]
            for key, value in outs.items():
                if hasattr(value, 'numpy'):
                    outs[key] = value.numpy()
            results.append(outs)

            if (step_id + 1) ==  len(loader):
                break

Can you please confirm if this is OK? I guess the 'len(loader)' should be equal to number of images or the number of loops it should execute. Am I right? will this work for batch of images?

rgkannan676 commented 2 years ago

I made the following changes for the for loop https://github.com/PaddlePaddle/PaddleDetection/blob/19478edce0d2b940fdb333159ee9e0f42a1d44bf/ppdet/engine/trainer.py#L374

1) reload self.loader for each loop. Otherwise it was failing on the 2nd epoch at "for step_id, data in enumerate(self.loader):" 2) added break statement as before.

        for epoch_id in range(self.start_epoch, self.cfg.epoch):
            self.status['mode'] = 'train'
            self.status['epoch_id'] = epoch_id
            self._compose_callback.on_epoch_begin(self.status)

            self.loader = create('{}Reader'.format(self.mode.capitalize()))(
                self.dataset, self.cfg.worker_num)
            self.loader.dataset.set_epoch(epoch_id)

            model.train()
            iter_tic = time.time()

            for step_id, data in enumerate(self.loader):
                self.status['data_time'].update(time.time() - iter_tic)
                self.status['step_id'] = step_id
                profiler.add_profiler_step(profiler_options)
                self._compose_callback.on_step_begin(self.status)
                data['epoch_id'] = epoch_id

                if self.cfg.get('fp16', False):
                    with amp.auto_cast(enable=self.cfg.use_gpu):
                        # model forward
                        outputs = model(data)
                        loss = outputs['loss']

                    # model backward
                    scaled_loss = scaler.scale(loss)
                    scaled_loss.backward()
                    # in dygraph mode, optimizer.minimize is equal to optimizer.step
                    scaler.minimize(self.optimizer, scaled_loss)
                else:
                    # model forward
                    outputs = model(data)
                    loss = outputs['loss']
                    # model backward
                    loss.backward()
                    self.optimizer.step()
                curr_lr = self.optimizer.get_lr()
                self.lr.step()
                if self.cfg.get('unstructured_prune'):
                    self.pruner.step()
                self.optimizer.clear_grad()
                self.status['learning_rate'] = curr_lr

                if self._nranks < 2 or self._local_rank == 0:
                    self.status['training_staus'].update(outputs)

                self.status['batch_time'].update(time.time() - iter_tic)
                self._compose_callback.on_step_end(self.status)
                if self.use_ema:
                    self.ema.update(self.model)
                iter_tic = time.time()

                if (step_id + 1) == len(self.loader):
                    break

Is this OK?

        for step_id, data in enumerate(loader):
            self.status['step_id'] = step_id
            self._compose_callback.on_step_begin(self.status)
            # forward
            outs = self.model(data)

            # update metrics
            for metric in self._metrics:
                metric.update(data, outs)

            sample_num += data['im_id'].numpy().shape[0]
            self._compose_callback.on_step_end(self.status)

            if (step_id + 1) ==  len(loader):
                break

Please advise if I can use this changes as a work around for this issue.

zoooo0820 commented 2 years ago

Hi, @rgkannan676 Sorry for late reply. It looks like there is no problem. If you want to check batch of images, you can use --infer_dir. It works in my environment. Btw, does this change works for inference and training in Windows?

rgkannan676 commented 2 years ago

Hi @zoooo0820 ,

Yes this works in my windows10 environment for both infer and training.