ApolloAuto / apollo

An open autonomous driving platform
Apache License 2.0
24.72k stars 9.62k forks source link

Failure in point_pillars_test for DoInference #13123

Open PGLF-EAP opened 3 years ago

PGLF-EAP commented 3 years ago

System information

Linux Ubuntu 18.04 Apollo 6.0 GTX 1060 3 Gb

Steps to reproduce the issue:

bazel test --config=gpu --test_size_filters=large //modules/perception/lidar/lib/detection/lidar_point_pillars:point_pillars_test

I am attempting to run the newer TensorRT version of the pointpillars algorithm and am getting CUDA out of memory errors with the current pfe and rpn files (Nov 12: #12974) which I tracked down to the RPN with some printline commands.

WARNING: Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles ERROR: ../rtSafe/safeRuntime.cpp (25) - Cuda Error in allocate: 2 (out of memory) ERROR: FAILED_ALLOCATION: std::exception I1203 15:47:08.481616 15483 point_pillars.cc:471] []RPN_CONTEXT FAILED E1203 15:47:08.481621 15483 point_pillars.cc:472] []Failed to create TensorRT Execution Context.

WARNING: Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles ERROR: ../rtSafe/safeRuntime.cpp (25) - Cuda Error in allocate: 2 (out of memory) ERROR: FAILED_ALLOCATION: std::exception I1203 15:47:08.481616 15483 point_pillars.cc:471] []RPN_CONTEXT FAILED E1203 15:47:08.481621 15483 point_pillars.cc:472] []Failed to create TensorRT Execution Context.

When I substitute an older version of the rpn I get a completed test but it fails by a large margin (Sept 17: #12561)

modules/perception/lidar/lib/detection/lidar_point_pillars/point_pillars_test.cc:654: Failure Expected: (num_objects) >= (10), actual: 0 vs 10 [ FAILED ] TestSuite.CheckDoInference (10329 ms) [----------] 3 tests from TestSuite (33528 ms total)

[----------] Global test environment tear-down [==========] 3 tests from 1 test suite ran. (33528 ms total) [ PASSED ] 2 tests. [ FAILED ] 1 test, listed below: [ FAILED ] TestSuite.CheckDoInference

Is the current point_pillars_test.cc up-to-date and should result in a passed grade or is it depreciated? Also are the current pfe and rpn files the correct ones to pass this test?

jeroldchen commented 3 years ago

@PGLF-EAP Because your GPU GTX 1060 has not enough CUDA memory for this test. The test has been verified under GTX 1080.

storypku commented 3 years ago

Hi @PGLF-EAP , you can try to run bazel test command with --config=opt to see if it's OK to pass this point_pillars_test.

I have a GTX 1070 machine, and It fails when running

bazel test --config=gpu --test_size_filters=large //modules/perception/lidar/lib/detection/lidar_point_pillars:point_pillars_test

with errors similar to yours, and when I run

bazel test --config=opt --config=gpu --test_size_filters=large //modules/perception/lidar/lib/detection/lidar_point_pillars:point_pillars_test

, the test passes.

PGLF-EAP commented 3 years ago

@storypku When you say the test passes, do you mean you get to the point where you get an output from or that it actually passes with above 10 objects detected? I still get "GPUassert: out of memory modules/perception/lidar/lib/detection/lidar_point_pillars/point_pillars_test.cc 437" when trying --config=opt with the current master pfe and rpn.
Though that could be because of what jeroldchen said about not having enough memory. How much memory does your GTX 1070 have?

@jeroldchen when you say the test has been verified under 1080 do you mean that it should be expected that running the current point_pillars_test.cc should yield 10+ objects detected?

storypku commented 3 years ago

@storypku When you say the test passes, do you mean you get to the point where you get an output from or that it actually passes with above 10 objects detected?

Yep.

Though that could be because of what jeroldchen said about not having enough memory. How much memory does your GTX 1070 have?

So it seems that --config=opt doesn't help much in your case, GPU memery of 3GB is not enough to run point-pillars-test.

FYI, GTX 1070 has also 8G mem. As can be seen from the following log on my host:

$ nvidia-smi 
Sat Dec  5 11:08:57 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 455.32.00    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    On   | 00000000:01:00.0 Off |                  N/A |
| 27%   28C    P8     6W / 151W |    547MiB /  8114MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1020      G   /usr/lib/xorg/Xorg                101MiB |
|    0   N/A  N/A      1885      G   /usr/lib/xorg/Xorg                374MiB |
|    0   N/A  N/A      2039      G   /usr/bin/gnome-shell               15MiB |
|    0   N/A  N/A     12520      G   ...AAAAAAAA== --shared-files       40MiB |
+-----------------------------------------------------------------------------+
PGLF-EAP commented 3 years ago

@storypku Can you post the test output for the passed test with 10+ objects detected? It would be nice to compare to when I can get my hands on a different graphics card.

storypku commented 3 years ago

@storypku Can you post the test output for the passed test with 10+ objects detected? It would be nice to compare to when I can get my hands on a different graphics card.

@jeroldchen will you please help @PGLF-EAP with your test output ?It's your expertise.

jeroldchen commented 3 years ago

@PGLF-EAP Obviously, your test showed a failure result, with the message "GPUassert: out of memory" given. And that is because your free GPU memory is too limited to run DoInference function. If the test passed, it should directly return "PASSED" status as the other passed tests you have seen, without any other messages output. You can run nvidia-smi to check your GPU memory usage and kill other processes that are using GPU memory, in order to release more memory for testing PointPillars. If that still doesn't work, I am sorry but you might have to use a GPU with larger memory size.

PGLF-EAP commented 3 years ago

@jeroldchen So I was able to get access to a higher GPU (GTX 2070 8gb) and was able to run the tests without any memory errors. However, I did not pass the DoInference test using the current Apollo master branch or the Nov 12 Apollo master branch 295e13e9681c2a1776a60c6206437c61e2a176c8 when the change to TensorRT took place.
I did pass the test using the Apollo 6.0.0 release which uses the torch inference.

So my question is should the test fail for the current TensorRT based PointPillars?