MIT-TESSE / tesse-gym

OpenAI Gym interface for training RL agents in TESSE
GNU General Public License v2.0
6 stars 3 forks source link

Simulation requires and blocks local network #3

Closed JD-ETH closed 4 years ago

JD-ETH commented 4 years ago

Good job at making the interface! It has been fun looking for fruits. I have some feedback to the simulator though:

  1. Sometimes in the environment you can observe the simulated robot goes out of the indoor space and drops off a cliff, eventually get stuck outside.
  2. I wonder if you can run the training without rendering? So far I haven't found out how to turn off the env.render().
  3. Is there plans to share the trained weights for the baseline?
ZacRavichandran commented 4 years ago

Glad you're enjoying it and thanks for the feedback!

  1. We put safeguards to prevent that, but it sounds like there are still some cases that need to be addressed. Do you have a screenshot and scene number of where that occurred so we can replicate the issue on our end?

  2. Because the agent uses visual input, rendering is required for training. If you'd like to train headless on a remote server, we have instructions for setting up a virtual display here. There's also a blog post detailing how to set up a virtual display on your local machine, but we haven't tried that ourselves.

  3. We do plan to share the trained weights, please stay tuned!

JD-ETH commented 4 years ago
  1. This happens rather often in the scene_id: 2. If you observe the training for a bit. I was using the ipython notebook example for training, not sure if the seed is deterministic.

2,3: Thanks !

  1. I'm not sure anyone has reported it, but it appears that the network for asynchronous training could break sometimes. I will upload a detailed error trace when it happens again, but it's inside the multiprocessing from tesse gym. this could happen several iterations into the training.
JD-ETH commented 4 years ago
Process ForkServerProcess-2:
Traceback (most recent call last):
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 18, in _worker
    observation, reward, done, info = env.step(data)
  File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/core/tesse_gym.py", line 201, in step
    reward, reward_info = self.compute_reward(response, action)
  File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/tasks/goseek/goseek.py", line 200, in compute_reward
    targets.metadata
AttributeError: 'NoneType' object has no attribute 'metadata'
Traceback (most recent call last):
  File "train-agent.py", line 102, in <module>
    main()
  File "train-agent.py", line 96, in main
    trainer.train(params['total_steps'])
  File "train-agent.py", line 31, in train
    self.rl_trainer.train(steps)
  File "/home/jd/competition/goseek/tesse-gym/src/agents/ppo_agent.py", line 84, in train
    self.model.learn(total_timesteps=steps, callback=self.callback)
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/ppo2/ppo2.py", line 336, in learn
    rollout = self.runner.run(callback)
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/runners.py", line 48, in run
    return self._run()
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/ppo2/ppo2.py", line 482, in _run
    self.obs[:], rewards, self.dones, infos = self.env.step(clipped_actions)
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/vec_env/base_vec_env.py", line 150, in step
    return self.step_wait()
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 107, in step_wait
    results = [remote.recv() for remote in self.remotes]
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 107, in <listcomp>
    results = [remote.recv() for remote in self.remotes]
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
JD-ETH commented 4 years ago

Upon further investigation I have made the following observation:

  1. During training if I'm connected to the Wifi it will severly slows down the local networks internet access. Removing PC from the local network fixes the issue.
  2. Above bug happens when I unplug the WIFI stick I use for my training PC.

I am not a networking expert but I would say there is a link between the training environment and this bug :smile:

JD-ETH commented 4 years ago

At the same time, without an active WIFI connection I am getting the same error. It seems like you need an active network connection for this?

Process ForkServerProcess-4:
Traceback (most recent call last):
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 13, in _worker
    env = env_fn_wrapper.var()
  File "/home/jd/competition/goseek/tesse-gym/src/envs/utils.py", line 19, in _thunk
    target_found_reward=target_found_reward,
  File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/tasks/goseek/goseek.py", line 97, in __init__
    ground_truth_mode=ground_truth_mode,
  File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/core/tesse_gym.py", line 152, in __init__
    self._init_pose()
  File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/core/tesse_gym.py", line 343, in _init_pose
    metadata = self.env.request(MetadataRequest()).metadata
AttributeError: 'NoneType' object has no attribute 'metadata'
Process ForkServerProcess-3:
ZacRavichandran commented 4 years ago
  1. That's very helpful, we tracked down the issue in the scene and will publish a fix

  2. Yes, the network is involved 😃, but you don't need an active network connection. The simulator uses TCP/UDP protocol to communicate with the agent. It looks like the error is coming from a stressed network. Though there is a resource limit that will govern the number of simulation instances a machine can run, we're adding some error handling to account for occasional timeouts (which is where the AttributeError: 'NoneType' object has no attribute 'metadata' error comes from).

How many environments are you using for training?

JD-ETH commented 4 years ago

4 environments as suggested. I have a RTX 2070 and 32 GB RAM, must be enough for that purpose?

It still doesn't make sense to me that the env returns NoneType if there is no active internet connection. So without an active WIFI network, the training doesn't run at all for me. If I understand correctly only the protocol is used but the lo internal network should be active independent of local networking. Quite eager to wait for a fix. Had trained with suggested parameters for 1 day of 1.6 mill steps but than it crashed :(

ZacRavichandran commented 4 years ago

Yep, that's definitely sufficient firepower.

The env returning NoneType means that the client can't communicate with the simulator over localhost, but we haven't observed this being dependent on an internet connection. Without the WIFI connection, can you ping localhost?

The fix should be out by tomorrow, thanks for your patience!

JD-ETH commented 4 years ago

I will test once the fix is pushed then, thanks a lot! On another note, due to the virus situation it's becoming unlikely that ICRA still takes place in the intended format. Please give a heads-up as soon as there is an update regarding the workshop!

ZacRavichandran commented 4 years ago

We just pushed an update that will handle occasional timeout errors. Please re-clone master and let us know what you think!

We've been following announcements from ICRA and will give an update regarding the workshop as soon as we have more information!

JD-ETH commented 4 years ago

I am still unable to train here when WIFI is off:

  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 13, in _worker
    env = env_fn_wrapper.var()
  File "/home/jd/competition/goseek/tesse-gym/src/envs/utils.py", line 19, in _thunk
    target_found_reward=target_found_reward,
  File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/tasks/goseek/goseek.py", line 97, in __init__
    ground_truth_mode=ground_truth_mode,
  File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/core/tesse_gym.py", line 158, in __init__
    self._init_pose()
  File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/core/tesse_gym.py", line 376, in _init_pose
    metadata_response = self._data_request(MetadataRequest())
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/core/tesse_gym.py", line 372, in _data_request
    raise TesseConnectionError()
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 13, in _worker
    env = env_fn_wrapper.var()
  File "/home/jd/competition/goseek/tesse-gym/src/envs/utils.py", line 19, in _thunk
    target_found_reward=target_found_reward,
tesse_gym.core.utils.TesseConnectionError: Cannot receive data from the simulator. The connection is blocked or the simulator is not running. 
  File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/tasks/goseek/goseek.py", line 97, in __init__
    ground_truth_mode=ground_truth_mode,
  File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/core/tesse_gym.py", line 158, in __init__
    self._init_pose()
  File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/core/tesse_gym.py", line 376, in _init_pose
    metadata_response = self._data_request(MetadataRequest())
  File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/core/tesse_gym.py", line 372, in _data_request
    raise TesseConnectionError()
tesse_gym.core.utils.TesseConnectionError: Cannot receive data from the simulator. The connection is blocked or the simulator is not running. 
Traceback (most recent call last):
  File "train-agent.py", line 102, in <module>
    main()
  File "train-agent.py", line 89, in main
    env = make_unity_env(simfile, range(params['env_num']), params['targets_num'], params['ep_len'], params['reward'])
  File "/home/jd/competition/goseek/tesse-gym/src/envs/utils.py", line 25, in make_unity_env
    return SubprocVecEnv([make_env(i) for i in num_env])
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 98, in __init__
    observation_space, action_space = self.remotes[0].recv()
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

with my WIFI turned off. I could reach my localhost when pinged. Maybe you have an idea? Or what protocol are you using so that I can look into and debug? The wifi slows down significantly when it's training.

JD-ETH commented 4 years ago

Can you confirm first that training works without an active Wifi connection for you?

JD-ETH commented 4 years ago

When running the training and trying to intercept wifi traffic over tcpdump -ni wlan0 I could see the following:

18:56:50.019812 IP 192.168.1.215.37128 > 255.255.255.255.9004: UDP, length 565
18:56:50.020323 IP 192.168.1.215.37128 > 255.255.255.255.9004: UDP, length 559
18:56:50.022466 IP 192.168.1.215.55402 > 255.255.255.255.9010: UDP, length 579

From the port name I would say this belong to the simulator. Interestingly it is routed through wifi.

JD-ETH commented 4 years ago

This happens already if you call the bare metal simulator ./goseek-v0.1.0.x86_64 --listen_port 9000 --send_port 9000 --set_resolution 320 240 when called in tesse_gym.py.

Dont have access to the source code of your environement and also don't seem to produce a --help. I will leave you to figure this out than.

JD-ETH commented 4 years ago

I tried blocking the relevant UDP ports but then the simulation fails to connect too.

griffith826 commented 4 years ago

Hello, we primarily developed this simulator using desktops without wifi. Luckily, I just received a new laptop last week. It has a fresh install of Ubuntu 16.04. Other than a few basic programs that I installed (e.g., slack, zoom), setting up the challenge software is the only thing that I've used it for.

Unfortunately, I cannot recreate what you are observing. I ran python eval.py --agent-config baselines/config/random-agent.yaml and tried various network configurations:

I did not observe any errors.

It does sound convincing that the error you are seeing in network-related. Is it possible you have some system configuration for a different application that might be interacting with this simulator? You also mentioned a USB wifi. I wonder if it is doing something to your network.

Anyway, I wanted to share a few other things to help you debug further.

  1. We have source code for the Unity simulator at https://github.com/MIT-TESSE/tesse-core. This is an Asset that you would import into a Unity project.
  2. With the simulator running, my computer prints the following when I run sudo netstat -tulp | grep gos. IThe simulator is using 0.0.0.0.
    tcp        0      0 0.0.0.0:9005            0.0.0.0:*               LISTEN      12306/simulator/gos 
    udp        0      0 0.0.0.0:55372           0.0.0.0:*                           12306/simulator/gos 
    udp        0      0 0.0.0.0:9000            0.0.0.0:*                           12306/simulator/gos 
    udp        0      0 0.0.0.0:9001            0.0.0.0:*                           12306/simulator/gos 
    udp        0      0 0.0.0.0:9002            0.0.0.0:*                           12306/simulator/gos 
    udp        0      0 0.0.0.0:9003            0.0.0.0:*                           12306/simulator/gos
  3. I ran nload -m with the simulator running. It shows that nearly all my network traffic is onlo See image below. Screenshot from 2020-03-29 18-14-56
JD-ETH commented 4 years ago

Thanks for the explicit explanation and tool set you suggested. I'm also working from a work station, unfortunately working from home these days means only wifi connection is available to me.

I can confirm that I have exact same output when calling the eval.py script. No wifi traffic is in use and all is routed through the local host: nice!

However, when calling the unity executable directly or call GoSeekFullPerception (This is the way it gets called in the baseline ipython notebook training example), wifi traffic appears: traffic

while the traffic in wifi previously was 0 with eval.py. I think we are very close to the issue and when I find time i will check later today how the simulation is called differently from eval.py.

JD-ETH commented 4 years ago

In addition, citing from a colleague of mine:

The problem is probably with the 255.255.255.255 address which is a magic value broadcasting to the "current network", which I have no clue how it is selected. Apparently to be the wifi in your case.

This is what I observed previously.

JD-ETH commented 4 years ago

There is a subtle difference on how you can the Executable, in my script (as well in the py notebook), the GoSeekFullPerception was called like this: from tesse_gym.tasks.goseek.goseek import GoSeekFullPerception whereas in the evaluation script like: from tesse_gym.tasks.goseek.goseek_full_perception import GoSeekFullPerception

Now the internal communication indeed go through loop device, however it's still polluting the network and it wouldn't run without active WIFI connection.

In summary, can you provide:

a stable way to execute the simulation environment with no active internet connection?

JD-ETH commented 4 years ago

@griffith826 I know the competition is delayed, but I have already spent one week trying to debug the training environment without success, Please give some instructions and try to reproduce my issue,.

griffith826 commented 4 years ago

We're hoping to get this working for you. I've tried to recreate your issue based on what I've gleaned from your previous posts, but haven't been able to.

Could you post a minimal set of step-by-step instructions to recreate the error? I'll run them on my end. If I can recreate it, great and I'm hopeful we can find a fix; if not, we might have to get more data/information about your particular setup to determine the root cause.

If feasible, we suggest trying to recreate the problem using tesse-interface, which provides a lower level of interaction with the simulator compared to tesse-gym. This would help us really narrow our focus on the network usage. You can look at the example notebook for example usage. Requesting images would be a good way to stress the network usage.

JD-ETH commented 4 years ago

Thanks a lot for your response. Several ways to recreate the issue:

  1. run the executable ./goseek-v0.1.0.x86_64 and observe all traffic is routed through local wifi network and won't run without active network.

  2. run the provided python notebook example and observe the same behavior after env is created.

  3. Run my fork with python train-agent.py -exp test_name and see the same. This is basically a python version of the previous notebook. Note the lo traffic here is 0. However, if importing GoSeekFullPerception from tesse_gym.tasks.goseek.goseek_full_perception the local network will be used and the frame rate increases a lot, but it still requires active network.

I will take a look at the tesse-interface and see if it helps. thanks!

JD-ETH commented 4 years ago

I have tried the said python notebook script but I'm not sure what to make of this. There is no output anywhere (likely because reponse is None), also no rendering, and it raises Error at

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-16-fa128e94ffe9> in <module>
----> 1 unity_start = unity_time(env)
      2 
      3 for _ in range(10):
      4     env.send(StepWithForce())
      5     print("Elapsed time is ", unity_time(env) - unity_start, " seconds.")

<ipython-input-15-c4ca710925c8> in unity_time(env)
      1 def unity_time(env):
      2     response = env.request(MetadataRequest())
----> 3     root = ET.fromstring(response.metadata)
      4     return float(root.find('time').text)

AttributeError: 'NoneType' object has no attribute 'metadata'

I have been using the goseek env, python 3.7, and ran python setup.py develop beforehand.

Dongbox commented 4 years ago

I found that I had the same problem. When the program starts, "Broken Pipe", "no attribute metadata", "Connection reset by peer", "EOFError" and other situations will appear after several times of socket communication, which is indeed the socket communication timeout and return None. Errors always occur when you run the emulator alone to interact or through the notebook And I always test this on WIFI.

JD-ETH commented 4 years ago

@topbobo thanks for proving I'm not crazy. Can you confirm that the env doesn't communicate with the notebook at all if no WIFI is present?

Dongbox commented 4 years ago

1.Open WIFI It will report an error until the function "model.lean()".

Screenshot from 2020-04-01 15-31-29

2.Without WIFI It will report an error when Environment initialization.

Screenshot from 2020-04-01 15-34-04

All of the above was tested under the latest version of git repository. Ubuntu16.04, Cuda10.2 tensorflow1.14 Because the wireless on my workstation was down, I had to use WIFI for the time being.

Dongbox commented 4 years ago

Its port runs when WIFI open

Screenshot from 2020-04-01 15-52-09

Dongbox commented 4 years ago

When I connected to the network using a USB Ethernet Adapter without WIFI, I was very glad that I had succeeded!!! @JD-ETH @griffith826 @ZacRavichandran

JD-ETH commented 4 years ago

Nice, so this issue only appears when connected via Wifi, good to know.

May I still get a fix for the WIFI case though? I don't have an adapter at home.

griffith826 commented 4 years ago

Thanks for sharing the ways to recreate this. I hope we have some good new; apologies it took a while to recreate on our end. I think we have a temporary solution and will come up with a change in the simulator, too, that provides a better user experience with wifi.

I was able to recreate your observation running the jupyter notebook (number 2 in your list above). Once I had two simulators running, it started to bring my wifi down, too. Apparently, running one simulator wasn't quite enough for me.

Anyway, the following temporary solution worked for me.

sudo ufw enable
sudo ufw deny out on wlo1 to any port 9000:9020 proto udp

This will enable your firewall, then add a rule to disable UDP on your wifi device for the ports that the simulator uses. The name of my wifi device is wlo1, so please replace that with the name of your device. I chose 9000:9020 to cover the ports that would get used if you ran our demonstration notebook with four simulators running simultaneously. You can determine which ports are being used by running sudo netstat -tulpn | grep goseek, if you need to modify that.

Please give it a try and do let us know if this helps.

JD-ETH commented 4 years ago

Did it work for you? I tried before already with iptables. Now I don't get response. This is the output applying those:

tesse_gym.core.utils.TesseConnectionError: Cannot receive data from the simulator. The connection is blocked or the simulator is not running. 
  File "/home/jd/competition/goseek/Fruit-Ninja-Camp/src/envs/utils.py", line 19, in _thunk
    target_found_reward=target_found_reward,
  File "/home/jd/competition/goseek/Fruit-Ninja-Camp/src/tesse_gym/tasks/goseek/goseek.py", line 97, in __init__
    ground_truth_mode=ground_truth_mode,
  File "/home/jd/competition/goseek/Fruit-Ninja-Camp/src/tesse_gym/core/tesse_gym.py", line 158, in __init__
    self._init_pose()
  File "/home/jd/competition/goseek/Fruit-Ninja-Camp/src/tesse_gym/core/tesse_gym.py", line 376, in _init_pose
    metadata_response = self._data_request(MetadataRequest())
  File "/home/jd/competition/goseek/Fruit-Ninja-Camp/src/tesse_gym/core/tesse_gym.py", line 372, in _data_request
    raise TesseConnectionError()
tesse_gym.core.utils.TesseConnectionError: Cannot receive data from the simulator. The connection is blocked or the simulator is not running. 
Traceback (most recent call last):
  File "train-agent.py", line 105, in <module>
    main()
  File "train-agent.py", line 92, in main
    env = make_unity_env(simfile, range(params['env_num']), params['targets_num'], params['ep_len'], params['reward'])
  File "/home/jd/competition/goseek/Fruit-Ninja-Camp/src/envs/utils.py", line 25, in make_unity_env
    return SubprocVecEnv([make_env(i) for i in num_env])
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 98, in __init__
    observation_space, action_space = self.remotes[0].recv()
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
(goseek) jd@trumptower:~/competition/goseek/Fruit-Ninja-Camp/src

@griffith826

griffith826 commented 4 years ago

This turned out to be complex. In your training scripts, please try adding a step_rate argument to the constructor for GoSeekFullPerception like below. I'm even more hopeful this will help.

def make_unity_env(filename, num_env):
    """ Create a wrapped Unity environment. """

    def make_env(rank):
        def _thunk():
            env = GoSeekFullPerception(
                str(filename),
                network_config=get_network_config(worker_id=rank),
                n_targets=n_targets,
                episode_length=episode_length,
                scene_id=scene_id[rank],
                target_found_reward=target_found_reward,
                step_rate=20  # add this argument
            )
            return env

        return _thunk

    return SubprocVecEnv([make_env(i) for i in range(num_env)])

The default value for step_rate is None, which ends up causing a problem in the simulator when we have our network devices disabled or are blocking certain ports. However, we can set a value here. eval.py actually sets it, which is why that script was still working.

Fingers crossed!

JD-ETH commented 4 years ago

Yes that seems to work. Hard to see how the rate can related to simulation tho :nerd_face: Thanks!