Open PGLF-EAP opened 3 years ago
@PGLF-EAP Because your GPU GTX 1060 has not enough CUDA memory for this test. The test has been verified under GTX 1080.
Hi @PGLF-EAP , you can try to run bazel test
command with --config=opt
to see if it's OK to pass this point_pillars_test.
I have a GTX 1070 machine, and It fails when running
bazel test --config=gpu --test_size_filters=large //modules/perception/lidar/lib/detection/lidar_point_pillars:point_pillars_test
with errors similar to yours, and when I run
bazel test --config=opt --config=gpu --test_size_filters=large //modules/perception/lidar/lib/detection/lidar_point_pillars:point_pillars_test
, the test passes.
@storypku When you say the test passes, do you mean you get to the point where you get an output from or that it actually passes with above 10 objects detected? I still get "GPUassert: out of memory modules/perception/lidar/lib/detection/lidar_point_pillars/point_pillars_test.cc 437" when trying --config=opt with the current master pfe and rpn.
Though that could be because of what jeroldchen said about not having enough memory. How much memory does your GTX 1070 have?
@jeroldchen when you say the test has been verified under 1080 do you mean that it should be expected that running the current point_pillars_test.cc should yield 10+ objects detected?
@storypku When you say the test passes, do you mean you get to the point where you get an output from or that it actually passes with above 10 objects detected?
Yep.
Though that could be because of what jeroldchen said about not having enough memory. How much memory does your GTX 1070 have?
So it seems that --config=opt
doesn't help much in your case, GPU memery of 3GB is not enough to run point-pillars-test.
FYI, GTX 1070 has also 8G mem. As can be seen from the following log on my host:
$ nvidia-smi
Sat Dec 5 11:08:57 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00 Driver Version: 455.32.00 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 On | 00000000:01:00.0 Off | N/A |
| 27% 28C P8 6W / 151W | 547MiB / 8114MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1020 G /usr/lib/xorg/Xorg 101MiB |
| 0 N/A N/A 1885 G /usr/lib/xorg/Xorg 374MiB |
| 0 N/A N/A 2039 G /usr/bin/gnome-shell 15MiB |
| 0 N/A N/A 12520 G ...AAAAAAAA== --shared-files 40MiB |
+-----------------------------------------------------------------------------+
@storypku Can you post the test output for the passed test with 10+ objects detected? It would be nice to compare to when I can get my hands on a different graphics card.
@storypku Can you post the test output for the passed test with 10+ objects detected? It would be nice to compare to when I can get my hands on a different graphics card.
@jeroldchen will you please help @PGLF-EAP with your test output ?It's your expertise.
@PGLF-EAP Obviously, your test showed a failure result, with the message "GPUassert: out of memory" given. And that is because your free GPU memory is too limited to run DoInference function. If the test passed, it should directly return "PASSED" status as the other passed tests you have seen, without any other messages output. You can run nvidia-smi
to check your GPU memory usage and kill other processes that are using GPU memory, in order to release more memory for testing PointPillars. If that still doesn't work, I am sorry but you might have to use a GPU with larger memory size.
@jeroldchen So I was able to get access to a higher GPU (GTX 2070 8gb) and was able to run the tests without any memory errors. However, I did not pass the DoInference test using the current Apollo master branch or the Nov 12 Apollo master branch 295e13e9681c2a1776a60c6206437c61e2a176c8 when the change to TensorRT took place.
I did pass the test using the Apollo 6.0.0 release which uses the torch inference.
So my question is should the test fail for the current TensorRT based PointPillars?
System information
Linux Ubuntu 18.04 Apollo 6.0 GTX 1060 3 Gb
Steps to reproduce the issue:
bazel test --config=gpu --test_size_filters=large //modules/perception/lidar/lib/detection/lidar_point_pillars:point_pillars_test
I am attempting to run the newer TensorRT version of the pointpillars algorithm and am getting CUDA out of memory errors with the current pfe and rpn files (Nov 12: #12974) which I tracked down to the RPN with some printline commands.
WARNING: Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles ERROR: ../rtSafe/safeRuntime.cpp (25) - Cuda Error in allocate: 2 (out of memory) ERROR: FAILED_ALLOCATION: std::exception I1203 15:47:08.481616 15483 point_pillars.cc:471] []RPN_CONTEXT FAILED E1203 15:47:08.481621 15483 point_pillars.cc:472] []Failed to create TensorRT Execution Context.
WARNING: Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles ERROR: ../rtSafe/safeRuntime.cpp (25) - Cuda Error in allocate: 2 (out of memory) ERROR: FAILED_ALLOCATION: std::exception I1203 15:47:08.481616 15483 point_pillars.cc:471] []RPN_CONTEXT FAILED E1203 15:47:08.481621 15483 point_pillars.cc:472] []Failed to create TensorRT Execution Context.
When I substitute an older version of the rpn I get a completed test but it fails by a large margin (Sept 17: #12561)
modules/perception/lidar/lib/detection/lidar_point_pillars/point_pillars_test.cc:654: Failure Expected: (num_objects) >= (10), actual: 0 vs 10 [ FAILED ] TestSuite.CheckDoInference (10329 ms) [----------] 3 tests from TestSuite (33528 ms total)
[----------] Global test environment tear-down [==========] 3 tests from 1 test suite ran. (33528 ms total) [ PASSED ] 2 tests. [ FAILED ] 1 test, listed below: [ FAILED ] TestSuite.CheckDoInference
Is the current point_pillars_test.cc up-to-date and should result in a passed grade or is it depreciated? Also are the current pfe and rpn files the correct ones to pass this test?