gazebosim / gazebo-classic

Gazebo classic. For the latest version, see https://github.com/gazebosim/gz-sim
http://classic.gazebosim.org/
Other
1.17k stars 477 forks source link

PERFORMANCE_transport_stress test times out on gazebo8/default on xenial #2320

Open osrf-migration opened 7 years ago

osrf-migration commented 7 years ago

Original report (archived issue) by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


Until recently, I was building the gazebo7 branch every night on my trusty machine and running make test in a loop. I recently switched to running gazebo8 on xenial and have noticed that the PERFORMANCE_transport_stress test fails to complete with the gazebo8 / xenial combination.

Furthermore, I have noticed that this test also fails in jenkins with default / xenial. The current oldest build is from Feb 11, 2017, and the most recent build is from July 18, 2017, and the transport stress test failed in both cases.

The last time I ran gazebo7 / trusty was July 11, 2017, and the test passed in 8.667 seconds with the following JUnit result file:

<?xml version="1.0" encoding="UTF-8"?>
<testsuites tests="2" failures="0" disabled="0" errors="0" timestamp="2017-07-12T05:32:52" time="8.667" name="AllTests">
  <testsuite name="TransportStressTest" tests="2" failures="0" disabled="0" errors="0" time="8.667">
    <testcase name="LocalPublish" status="run" time="3.302" classname="TransportStressTest" />
    <testcase name="ManyNodes" status="run" time="5.364" classname="TransportStressTest" />
  </testsuite>
</testsuites>

I just ran the test manually without the make test invocation so that it isn't subject to the timeout, and it took over 30 minutes. I'll try to do some profiling to identify what is causing the slowdown.

osrf-migration commented 7 years ago

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


On a second try, it took 26 minutes, about to try profiling.

osrf-migration commented 7 years ago

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


Here's the console output from that 26 minute run:

[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from TransportStressTest
[ RUN      ] TransportStressTest.LocalPublish
[Msg] Waiting for master.
[Msg] Connected to gazebo master @ http://127.0.0.1:11345
[Msg] Publicized address: 172.23.2.90
[Dbg] [ServerFixture.cc:209] ServerFixture load in 1.1 seconds, timeout after 600 seconds
[Msg] Publish Complete
/data_fast/scpeters/ws/tmp/gazebo_build/src/gazebo/test/performance/transport_stress.cc:109: Failure
Expected: (diff.sec) < (g_localPublishMessageCount * 0.0008), actual: 1640 vs 800
[Msg] Time to publish 1000000 messages = 1640 518320016
[Dbg] [ServerFixture.cc:135] ServerFixture::Unload
AL lib: (WW) alcCloseDevice: Releasing context 0x7f115152d0d0
[  FAILED  ] TransportStressTest.LocalPublish (1641800 ms)
[ RUN      ] TransportStressTest.ManyNodes
[Msg] Waiting for master.
[Msg] Connected to gazebo master @ http://127.0.0.1:11345
[Msg] Publicized address: 172.23.2.90
[Dbg] [ServerFixture.cc:209] ServerFixture load in 1.2 seconds, timeout after 600 seconds
[Msg] Publish complete
/data_fast/scpeters/ws/tmp/gazebo_build/src/gazebo/test/performance/transport_stress.cc:224: Failure
Expected: (pubDiff.sec) <= (2), actual: 12 vs 2
[Msg] Time to publish 10000 = 12 312717664
[Msg] Time to receive 10000000 = 12 314618631
[Dbg] [ServerFixture.cc:135] ServerFixture::Unload
AL lib: (WW) alcCloseDevice: Releasing context 0x7f1150206140
[  FAILED  ] TransportStressTest.ManyNodes (15332 ms)
[----------] 2 tests from TransportStressTest (1657132 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (1657132 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] TransportStressTest.LocalPublish
[  FAILED  ] TransportStressTest.ManyNodes

The bulk of the time was spent in the LocalPublish test, and profiling showed that it spent more than 50% of its time allocating and freeing memory.

osrf-migration commented 7 years ago

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


I just compiled gazebo7 from source on my xenial machine, and it has the same failure. I wonder if it is a difference between the protobuf versions?

osrf-migration commented 7 years ago

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


Raising priority since this seems like a significant performance impact

osrf-migration commented 7 years ago

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


Confirmed that gazebo7 in a trusty container doesn't have this issue. Weird!

osrf-migration commented 6 years ago

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


It could be the difference in protobuf version. I did some profiling, and it seems to be spending most of its time copying the image message in order to call Publish