HorizonRobotics / alf

Agent Learning Framework https://alf.readthedocs.io
Apache License 2.0
292 stars 47 forks source link

Speeding up the loading of SocialRobot environment #83

Open emailweixu opened 5 years ago

emailweixu commented 5 years ago

When using 30 or 60 parallel environments, the time for loading the environments become quite long. It seems that the environments are loaded sequentially. We might be able to speed up the loading by loading the environments parallelly

witwolf commented 5 years ago

it's configurable ParallelPyEnvironment.start_serially=False

emailweixu commented 5 years ago

It's good to have this option. Unfortunately, it doesn't seem to make SimpleNavigation load faster.

witwolf commented 5 years ago

it actually speed up the process, but it has problem with social_bot yet, it seems the spawned gazebo process shares some resources and can be Influenced at initialization stage

i tried the test below :

if __name__ == '__main__':

    gin.parse_config([
        'create_environment.num_parallel_environments=30',
        'ParallelPyEnvironment.start_serially=False'
    ])
    env = create_environment(
        env_name="SocialBot-Pr2Gripper-v0",
        env_load_fn=suite_socialbot.load)

    print('all env created')

    while True:
        time.sleep(5)

and get this error, is show error occurs at social_bot::Initialize

Traceback (most recent call last):
  File "test.py", line 30, in <module>
    env_load_fn=suite_socialbot.load)
  File "/usr/local/lib/python3.5/dist-packages/gin/config.py", line 1032, in wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/usr/local/lib/python3.5/dist-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
    six.raise_from(proxy.with_traceback(exception.__traceback__), None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.5/dist-packages/gin/config.py", line 1009, in wrapper
    return fn(*new_args, **new_kwargs)
  File "/home/hongyingxiang/FLA/alf/environments/utils.py", line 44, in create_environment
    [lambda: env_load_fn(env_name)] * num_parallel_environments)
  File "/usr/local/lib/python3.5/dist-packages/gin/config.py", line 1032, in wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/usr/local/lib/python3.5/dist-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
    six.raise_from(proxy.with_traceback(exception.__traceback__), None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.5/dist-packages/gin/config.py", line 1009, in wrapper
    return fn(*new_args, **new_kwargs)
  File "/home/hongyingxiang/FLA/tf_agents/tf_agents/environments/parallel_py_environment.py", line 70, in __init__
    self.start()
  File "/home/hongyingxiang/FLA/tf_agents/tf_agents/environments/parallel_py_environment.py", line 87, in start
    env.wait_start()
  File "/home/hongyingxiang/FLA/tf_agents/tf_agents/environments/parallel_py_environment.py", line 224, in wait_start
    assert result is self._READY, result
AssertionError: (5, 'Traceback (most recent call last):\n  File "/home/hongyingxiang/FLA/tf_agents/tf_agents/environments/parallel_py_environment.py", line 346, in _worker\n    env = env_constructor()\n  File "/home/hongyingxiang/FLA/alf/environments/utils.py", line 44, in <lambda>\n    [lambda: env_load_fn(env_name)] * num_parallel_environments)\n  File "/home/hongyingxiang/FLA/alf/environments/utils.py", line 42, in <lambda>\n    env_name, wrap_with_process=False)\n  File "/usr/local/lib/python3.5/dist-packages/gin/config.py", line 1032, in wrapper\n    utils.augment_exception_message_and_reraise(e, err_str)\n  File "/usr/local/lib/python3.5/dist-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise\n    six.raise_from(proxy.with_traceback(exception.__traceback__), None)\n  File "<string>", line 3, in raise_from\n  File "/usr/local/lib/python3.5/dist-packages/gin/config.py", line 1009, in wrapper\n    return fn(*new_args, **new_kwargs)\n  File "/home/hongyingxiang/FLA/alf/environments/suite_socialbot.py", line 120, in load\n    py_env.reset()\n  File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__\n    self.gen.throw(type, value, traceback)\n  File "/home/hongyingxiang/FLA/alf/environments/suite_socialbot.py", line 148, in _get_unused_port\n    yield unused_port\n  File "/home/hongyingxiang/FLA/alf/environments/suite_socialbot.py", line 119, in load\n    py_env = env_ctor(port)\n  File "/home/hongyingxiang/FLA/alf/environments/suite_socialbot.py", line 103, in env_ctor\n    gym_env = gym_spec.make(port=port)\n  File "/usr/local/lib/python3.5/dist-packages/gym/envs/registration.py", line 87, in make\n    env = cls(**_kwargs)\n  File "/opt/workspace/yingxiang.hong/SocialRobot/python/social_bot/envs/pr2.py", line 88, in __init__\n    super(Pr2Gripper, self).__init__(port=port)\n  File "/opt/workspace/yingxiang.hong/SocialRobot/python/social_bot/envs/gazebo_base.py", line 31, in __init__\n    gazebo.initialize(port=port)\nRuntimeError: Caught an unknown exception!\n  In call to configurable \'load\' (<function load at 0x7f3e72b6f2f0>)\n')
  In call to configurable 'ParallelPyEnvironment' (<function ParallelPyEnvironment.__init__ at 0x7f3e72ce20d0>)
  In call to configurable 'create_environment' (<function create_environment at 0x7f3f0dc1fd90>)

but when i tried test below parallel_num=100 , it works

int main(int argc, char **argv) {
  int parallel_num = atoi(argv[1]);
  int port_start = atoi(argv[2]);

  int pid;

  for (int i = 0; i < parallel_num; i++) {
    pid = fork();
    if (pid == 0) {
      std::vector <std::string> args;
      Initialize(args, port_start + i);
      break;
    }
  }

  while (true) {
    sleep(1000);
  }
}
witwolf commented 5 years ago

a bug in suite_socialbot._get_unused_port and now fixed

witwolf commented 4 years ago

SocialRobot can not load parallelly at present due to some unknown problem with load world parallell in gazebo

when i test with code below test 60 11345

 char *world_file="/home/hongyingxiang/SocialRobot/python/social_bot/worlds/pr2.world";

  void Initialize(const std::vector <std::string> &args, int port = 0) {
    static std::once_flag flag;
    if (port != 0) {
      std::string uri = "localhost:" + std::to_string(port);
      setenv("GAZEBO_MASTER_URI", uri.c_str(), 1);
    }

    gazebo::common::Console::SetQuiet(false);
    gazebo::setupServer(args);
    gazebo::util::LogRecordParams params;
    params.period = 1e300;  // In fact, we don't need to do logging.
    gazebo::util::LogRecord::Instance()->Start(params);
  }

  int main(int argc, char **argv) {
    int parallel_num = atoi(argv[1]);
    int port_start = atoi(argv[2]);

    int pid;

    for (int i = 0; i < parallel_num; i++) {
      pid = fork();
      if (pid == 0) {
        std::vector <std::string> args;
        int port = port_start + i;
        Initialize(args, port);
        gazebo::physics::WorldPtr world = gazebo::loadWorld(world_file);
        break;
      }
    }

    while (true) {
      sleep(1000);
    }
  }

i found some sub processes were blocked (gazebo initialize but world loading unfinished ) , i got these sub process pids by simple get port find ~/.gazebo/ -type f|xargs ls -l (small log file prone to be blocked) , then netstat -nap |grep ${port} to get pid , and with gdb debug, get trace below

    #0  0x00007f91d232a26d in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
    #1  0x00007f91d2323dbd in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
    #2  0x00007f91cc4fbc36 in ignition::transport::Node::PublisherPrivate::~PublisherPrivate() ()
       from /usr/lib/x86_64-linux-gnu/libignition-transport4.so.4
    #3  0x00007f91d20b0dbc in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x6f097a0)
        at /usr/include/c++/5/bits/shared_ptr_base.h:150
    #4  0x00007f91d1d37a78 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::operator= (this=0x6ed17c8, __r=...)
        at /usr/include/c++/5/bits/shared_ptr_base.h:678
    #5  0x00007f91d1d378d9 in std::__shared_ptr<ignition::transport::Node::PublisherPrivate, (__gnu_cxx::_Lock_policy)2>::operator= (this=0x6ed17c0)
        at /usr/include/c++/5/bits/shared_ptr_base.h:867
    #6  0x00007f91d1d37903 in std::shared_ptr<ignition::transport::Node::PublisherPrivate>::operator= (this=0x6ed17c0)
        at /usr/include/c++/5/bits/shared_ptr.h:93
    #7  0x00007f91d1d37935 in ignition::transport::Node::Publisher::operator= (this=0x6ed17b8)
        at /usr/include/ignition/transport4/ignition/transport/Node.hh:84
    #8  0x00007f91d1d33a1c in gazebo::sensors::CameraSensor::Load (this=0x6ed1620, _worldName=...)
        at /home/hongyingxiang/gazebo/gazebo/sensors/CameraSensor.cc:103
    #9  0x00007f91d1d9038f in gazebo::sensors::Sensor::Load (this=0x6ed1620, _worldName=..., _sdf=...)
        at /home/hongyingxiang/gazebo/gazebo/sensors/Sensor.cc:84
    #10 0x00007f91d1d332e8 in gazebo::sensors::CameraSensor::Load (this=0x6ed1620, _worldName=..., _sdf=...)
        at /home/hongyingxiang/gazebo/gazebo/sensors/CameraSensor.cc:71
    #11 0x00007f91d1d9fe52 in gazebo::sensors::SensorManager::CreateSensor (this=0x619140 <SingletonT<gazebo::sensors::SensorManager>::GetInstance()::t>,
        _elem=..., _worldName=..., _parentName=..., _parentId=147) at /home/hongyingxiang/gazebo/gazebo/sensors/SensorManager.cc:304
    #12 0x00007f91d1d9fc25 in gazebo::sensors::SensorManager::OnCreateSensor (
        this=0x619140 <SingletonT<gazebo::sensors::SensorManager>::GetInstance()::t>, _elem=..., _worldName=..., _parentName=..., _parentId=147)
        at /home/hongyingxiang/gazebo/gazebo/sensors/SensorManager.cc:282
    #13 0x00007f91d1db1a20 in std::_Mem_fn_base<void (gazebo::sensors::SensorManager::*)(std::shared_ptr<sdf::Element>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int), true>::operator()<std::shared_ptr<sdf::Element>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, void>(gazebo::sensors::SensorManager*, std::shared_ptr<sdf::Element>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int&&) const (this=0x20974d0,
        __object=0x619140 <SingletonT<gazebo::sensors::SensorManager>::GetInstance()::t>) at /usr/include/c++/5/functional:600
    #14 0x00007f91d1dafec9 in std::_Bind<std::_Mem_fn<void (gazebo::sensors::SensorManager::*)(std::shared_ptr<sdf::Element>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int)> (gazebo::sensors::SensorManager*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>, std::_Placeholder<4>)>::__call<void, std::shared_ptr<sdf::Element>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int&&, 0ul, 1ul, 2ul, 3ul, 4ul>(std::tuple<std::shared_ptr<sdf::Element>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int&&>&&, std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul>) (this=0x20974d0,
        __args=<unknown type in /home/hongyingxiang/usr/lib/libgazebo_sensors.so.11, CU 0x5a356c, DIE 0x607d7e>) at /usr/include/c++/5/functional:1074
    #15 0x00007f91d1dada34 in std::_Bind<std::_Mem_fn<void (gazebo::sensors::SensorManager::*)(std::shared_ptr<sdf::Element>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int)> (gazebo::sensors::SensorManager*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>, std::_Placeholder<4>)>::operator()<std::shared_ptr<sdf::Element>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, void>(std::shared_ptr<sdf::Element>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int&&) (
    ---Type <return> to continue, or q <return> to quit---
        this=0x20974d0) at /usr/include/c++/5/functional:1133
    #16 0x00007f91d1dab54f in std::_Function_handler<void (std::shared_ptr<sdf::Element>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int), std::_Bind<std::_Mem_fn<void (gazebo::sensors::SensorManager::*)(std::shared_ptr<sdf::Element>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int)> (gazebo::sensors::SensorManager*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>, std::_Placeholder<4>)> >::_M_invoke(std::_Any_data const&, std::shared_ptr<sdf::Element>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int&&) (__functor=...,
        __args#0=<unknown type in /home/hongyingxiang/usr/lib/libgazebo_sensors.so.11, CU 0x5a356c, DIE 0x60011b>, __args#1=..., __args#2=...,
        __args#3=<unknown type in /home/hongyingxiang/usr/lib/libgazebo_sensors.so.11, CU 0x5a356c, DIE 0x60012a>) at /usr/include/c++/5/functional:1871
    #17 0x00007f91cef009e8 in std::function<void (std::shared_ptr<sdf::Element>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int)>::operator()(std::shared_ptr<sdf::Element>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int) const (this=0x2083678, __args#0=..., __args#1=..., __args#2=..., __args#3=147)
        at /usr/include/c++/5/functional:2267
    #18 0x00007f91ceeff955 in gazebo::event::EventT<void (std::shared_ptr<sdf::Element>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int)>::Signal<std::shared_ptr<sdf::Element>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned int>(std::shared_ptr<sdf::Element> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int const&) (
        this=0x7f91d1629d00 <gazebo::event::Events::createSensor[abi:cxx11]>, _p1=..., _p2=..., _p3=..., _p4=@0x7ffc7399fa80: 147)
        at /home/hongyingxiang/gazebo/gazebo/common/Event.hh:344
    #19 0x00007f91ceefef79 in gazebo::event::EventT<void (std::shared_ptr<sdf::Element>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int)>::operator()<std::shared_ptr<sdf::Element>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned int>(std::shared_ptr<sdf::Element> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int const&) (
        this=0x7f91d1629d00 <gazebo::event::Events::createSensor[abi:cxx11]>, _p1=..., _p2=..., _p3=..., _p4=@0x7ffc7399fa80: 147)
        at /home/hongyingxiang/gazebo/gazebo/common/Event.hh:159
    #20 0x00007f91cef29195 in gazebo::physics::Link::Load (this=0x6e7ca40, _sdf=...) at /home/hongyingxiang/gazebo/gazebo/physics/Link.cc:197
    #21 0x00007f91ced90041 in gazebo::physics::ODELink::Load (this=0x6e7ca40, _sdf=...) at /home/hongyingxiang/gazebo/gazebo/physics/ode/ODELink.cc:58
    #22 0x00007f91cef5adb8 in gazebo::physics::Model::LoadLinks (this=0x335d550) at /home/hongyingxiang/gazebo/gazebo/physics/Model.cc:176
    #23 0x00007f91cef5a5bb in gazebo::physics::Model::Load (this=0x335d550, _sdf=...) at /home/hongyingxiang/gazebo/gazebo/physics/Model.cc:103
    #24 0x00007f91cefd18fb in gazebo::physics::World::LoadModel (this=0x2485040, _sdf=..., _parent=...)
        at /home/hongyingxiang/gazebo/gazebo/physics/World.cc:1083
    #25 0x00007f91cefd2a29 in gazebo::physics::World::LoadEntities (this=0x2485040, _sdf=..., _parent=...)
        at /home/hongyingxiang/gazebo/gazebo/physics/World.cc:1187
    #26 0x00007f91cefcc297 in gazebo::physics::World::Load (this=0x2485040, _sdf=...) at /home/hongyingxiang/gazebo/gazebo/physics/World.cc:329
    #27 0x00007f91cef880a6 in gazebo::physics::load_world (_world=..., _sdf=...) at /home/hongyingxiang/gazebo/gazebo/physics/PhysicsIface.cc:143
    #28 0x00007f91d20e056f in gazebo::loadWorld (_worldFile=...) at /home/hongyingxiang/gazebo/gazebo/gazebo.cc:183
    #29 0x000000000040aa85 in main (argc=3, argv=0x7ffc739a0d68) at /home/hongyingxiang/test_gazebo/test.cpp:73
emailweixu commented 4 years ago

It seems to block when trying to acquire some mutex at the destructor.