Closed JD-ETH closed 4 years ago
Glad you're enjoying it and thanks for the feedback!
We put safeguards to prevent that, but it sounds like there are still some cases that need to be addressed. Do you have a screenshot and scene number of where that occurred so we can replicate the issue on our end?
Because the agent uses visual input, rendering is required for training. If you'd like to train headless on a remote server, we have instructions for setting up a virtual display here. There's also a blog post detailing how to set up a virtual display on your local machine, but we haven't tried that ourselves.
We do plan to share the trained weights, please stay tuned!
2,3: Thanks !
Process ForkServerProcess-2:
Traceback (most recent call last):
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 18, in _worker
observation, reward, done, info = env.step(data)
File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/core/tesse_gym.py", line 201, in step
reward, reward_info = self.compute_reward(response, action)
File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/tasks/goseek/goseek.py", line 200, in compute_reward
targets.metadata
AttributeError: 'NoneType' object has no attribute 'metadata'
Traceback (most recent call last):
File "train-agent.py", line 102, in <module>
main()
File "train-agent.py", line 96, in main
trainer.train(params['total_steps'])
File "train-agent.py", line 31, in train
self.rl_trainer.train(steps)
File "/home/jd/competition/goseek/tesse-gym/src/agents/ppo_agent.py", line 84, in train
self.model.learn(total_timesteps=steps, callback=self.callback)
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/ppo2/ppo2.py", line 336, in learn
rollout = self.runner.run(callback)
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/runners.py", line 48, in run
return self._run()
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/ppo2/ppo2.py", line 482, in _run
self.obs[:], rewards, self.dones, infos = self.env.step(clipped_actions)
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/vec_env/base_vec_env.py", line 150, in step
return self.step_wait()
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 107, in step_wait
results = [remote.recv() for remote in self.remotes]
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 107, in <listcomp>
results = [remote.recv() for remote in self.remotes]
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
Upon further investigation I have made the following observation:
I am not a networking expert but I would say there is a link between the training environment and this bug :smile:
At the same time, without an active WIFI connection I am getting the same error. It seems like you need an active network connection for this?
Process ForkServerProcess-4:
Traceback (most recent call last):
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 13, in _worker
env = env_fn_wrapper.var()
File "/home/jd/competition/goseek/tesse-gym/src/envs/utils.py", line 19, in _thunk
target_found_reward=target_found_reward,
File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/tasks/goseek/goseek.py", line 97, in __init__
ground_truth_mode=ground_truth_mode,
File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/core/tesse_gym.py", line 152, in __init__
self._init_pose()
File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/core/tesse_gym.py", line 343, in _init_pose
metadata = self.env.request(MetadataRequest()).metadata
AttributeError: 'NoneType' object has no attribute 'metadata'
Process ForkServerProcess-3:
That's very helpful, we tracked down the issue in the scene and will publish a fix
Yes, the network is involved 😃, but you don't need an active network connection. The simulator uses TCP/UDP protocol to communicate with the agent. It looks like the error is coming from a stressed network. Though there is a resource limit that will govern the number of simulation instances a machine can run, we're adding some error handling to account for occasional timeouts (which is where the AttributeError: 'NoneType' object has no attribute 'metadata'
error comes from).
How many environments are you using for training?
4 environments as suggested. I have a RTX 2070 and 32 GB RAM, must be enough for that purpose?
It still doesn't make sense to me that the env returns NoneType
if there is no active internet connection. So without an active WIFI network, the training doesn't run at all for me. If I understand correctly only the protocol is used but the lo
internal network should be active independent of local networking. Quite eager to wait for a fix. Had trained with suggested parameters for 1 day of 1.6 mill steps but than it crashed :(
Yep, that's definitely sufficient firepower.
The env returning NoneType
means that the client can't communicate with the simulator over localhost
, but we haven't observed this being dependent on an internet connection. Without the WIFI connection, can you ping localhost?
The fix should be out by tomorrow, thanks for your patience!
I will test once the fix is pushed then, thanks a lot! On another note, due to the virus situation it's becoming unlikely that ICRA still takes place in the intended format. Please give a heads-up as soon as there is an update regarding the workshop!
We just pushed an update that will handle occasional timeout errors. Please re-clone master and let us know what you think!
We've been following announcements from ICRA and will give an update regarding the workshop as soon as we have more information!
I am still unable to train here when WIFI is off:
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 13, in _worker
env = env_fn_wrapper.var()
File "/home/jd/competition/goseek/tesse-gym/src/envs/utils.py", line 19, in _thunk
target_found_reward=target_found_reward,
File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/tasks/goseek/goseek.py", line 97, in __init__
ground_truth_mode=ground_truth_mode,
File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/core/tesse_gym.py", line 158, in __init__
self._init_pose()
File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/core/tesse_gym.py", line 376, in _init_pose
metadata_response = self._data_request(MetadataRequest())
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/core/tesse_gym.py", line 372, in _data_request
raise TesseConnectionError()
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 13, in _worker
env = env_fn_wrapper.var()
File "/home/jd/competition/goseek/tesse-gym/src/envs/utils.py", line 19, in _thunk
target_found_reward=target_found_reward,
tesse_gym.core.utils.TesseConnectionError: Cannot receive data from the simulator. The connection is blocked or the simulator is not running.
File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/tasks/goseek/goseek.py", line 97, in __init__
ground_truth_mode=ground_truth_mode,
File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/core/tesse_gym.py", line 158, in __init__
self._init_pose()
File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/core/tesse_gym.py", line 376, in _init_pose
metadata_response = self._data_request(MetadataRequest())
File "/home/jd/competition/goseek/tesse-gym/src/tesse_gym/core/tesse_gym.py", line 372, in _data_request
raise TesseConnectionError()
tesse_gym.core.utils.TesseConnectionError: Cannot receive data from the simulator. The connection is blocked or the simulator is not running.
Traceback (most recent call last):
File "train-agent.py", line 102, in <module>
main()
File "train-agent.py", line 89, in main
env = make_unity_env(simfile, range(params['env_num']), params['targets_num'], params['ep_len'], params['reward'])
File "/home/jd/competition/goseek/tesse-gym/src/envs/utils.py", line 25, in make_unity_env
return SubprocVecEnv([make_env(i) for i in num_env])
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 98, in __init__
observation_space, action_space = self.remotes[0].recv()
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
with my WIFI turned off. I could reach my localhost when pinged. Maybe you have an idea? Or what protocol are you using so that I can look into and debug? The wifi slows down significantly when it's training.
Can you confirm first that training works without an active Wifi connection for you?
When running the training and trying to intercept wifi traffic over tcpdump -ni wlan0
I could see the following:
18:56:50.019812 IP 192.168.1.215.37128 > 255.255.255.255.9004: UDP, length 565
18:56:50.020323 IP 192.168.1.215.37128 > 255.255.255.255.9004: UDP, length 559
18:56:50.022466 IP 192.168.1.215.55402 > 255.255.255.255.9010: UDP, length 579
From the port name I would say this belong to the simulator. Interestingly it is routed through wifi.
This happens already if you call the bare metal simulator ./goseek-v0.1.0.x86_64 --listen_port 9000 --send_port 9000 --set_resolution 320 240
when called in tesse_gym.py
.
Dont have access to the source code of your environement and also don't seem to produce a --help. I will leave you to figure this out than.
I tried blocking the relevant UDP ports but then the simulation fails to connect too.
Hello, we primarily developed this simulator using desktops without wifi. Luckily, I just received a new laptop last week. It has a fresh install of Ubuntu 16.04. Other than a few basic programs that I installed (e.g., slack, zoom), setting up the challenge software is the only thing that I've used it for.
Unfortunately, I cannot recreate what you are observing. I ran python eval.py --agent-config baselines/config/random-agent.yaml
and tried various network configurations:
lo
enabled,lo
enabled, then enabled my other devices.I did not observe any errors.
It does sound convincing that the error you are seeing in network-related. Is it possible you have some system configuration for a different application that might be interacting with this simulator? You also mentioned a USB wifi. I wonder if it is doing something to your network.
Anyway, I wanted to share a few other things to help you debug further.
sudo netstat -tulp | grep gos
. IThe simulator is using 0.0.0.0
.
tcp 0 0 0.0.0.0:9005 0.0.0.0:* LISTEN 12306/simulator/gos
udp 0 0 0.0.0.0:55372 0.0.0.0:* 12306/simulator/gos
udp 0 0 0.0.0.0:9000 0.0.0.0:* 12306/simulator/gos
udp 0 0 0.0.0.0:9001 0.0.0.0:* 12306/simulator/gos
udp 0 0 0.0.0.0:9002 0.0.0.0:* 12306/simulator/gos
udp 0 0 0.0.0.0:9003 0.0.0.0:* 12306/simulator/gos
nload -m
with the simulator running. It shows that nearly all my network traffic is onlo
See image below.
Thanks for the explicit explanation and tool set you suggested. I'm also working from a work station, unfortunately working from home these days means only wifi connection is available to me.
I can confirm that I have exact same output when calling the eval.py
script. No wifi traffic is in use and all is routed through the local host: nice!
However, when calling the unity executable directly or call GoSeekFullPerception
(This is the way it gets called in the baseline ipython notebook training example), wifi traffic appears:
while the traffic in wifi previously was 0 with eval.py
. I think we are very close to the issue and when I find time i will check later today how the simulation is called differently from eval.py
.
In addition, citing from a colleague of mine:
The problem is probably with the 255.255.255.255 address which is a magic value broadcasting to the "current network", which I have no clue how it is selected. Apparently to be the wifi in your case.
This is what I observed previously.
There is a subtle difference on how you can the Executable, in my script (as well in the py notebook), the GoSeekFullPerception
was called like this:
from tesse_gym.tasks.goseek.goseek import GoSeekFullPerception
whereas in the evaluation script like:
from tesse_gym.tasks.goseek.goseek_full_perception import GoSeekFullPerception
Now the internal communication indeed go through loop device, however it's still polluting the network and it wouldn't run without active WIFI connection.
In summary, can you provide:
a stable way to execute the simulation environment with no active internet connection?
@griffith826 I know the competition is delayed, but I have already spent one week trying to debug the training environment without success, Please give some instructions and try to reproduce my issue,.
We're hoping to get this working for you. I've tried to recreate your issue based on what I've gleaned from your previous posts, but haven't been able to.
Could you post a minimal set of step-by-step instructions to recreate the error? I'll run them on my end. If I can recreate it, great and I'm hopeful we can find a fix; if not, we might have to get more data/information about your particular setup to determine the root cause.
If feasible, we suggest trying to recreate the problem using tesse-interface, which provides a lower level of interaction with the simulator compared to tesse-gym
. This would help us really narrow our focus on the network usage. You can look at the example notebook for example usage. Requesting images would be a good way to stress the network usage.
Thanks a lot for your response. Several ways to recreate the issue:
run the executable ./goseek-v0.1.0.x86_64
and observe all traffic is routed through local wifi network and won't run without active network.
run the provided python notebook example and observe the same behavior after env
is created.
Run my fork with python train-agent.py -exp test_name
and see the same. This is basically a python version of the previous notebook. Note the lo
traffic here is 0. However, if importing GoSeekFullPerception
from tesse_gym.tasks.goseek.goseek_full_perception
the local network will be used and the frame rate increases a lot, but it still requires active network.
I will take a look at the tesse-interface and see if it helps. thanks!
I have tried the said python notebook script but I'm not sure what to make of this. There is no output anywhere (likely because reponse is None), also no rendering, and it raises Error at
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-16-fa128e94ffe9> in <module>
----> 1 unity_start = unity_time(env)
2
3 for _ in range(10):
4 env.send(StepWithForce())
5 print("Elapsed time is ", unity_time(env) - unity_start, " seconds.")
<ipython-input-15-c4ca710925c8> in unity_time(env)
1 def unity_time(env):
2 response = env.request(MetadataRequest())
----> 3 root = ET.fromstring(response.metadata)
4 return float(root.find('time').text)
AttributeError: 'NoneType' object has no attribute 'metadata'
I have been using the goseek env, python 3.7, and ran python setup.py develop
beforehand.
I found that I had the same problem. When the program starts, "Broken Pipe", "no attribute metadata", "Connection reset by peer", "EOFError" and other situations will appear after several times of socket communication, which is indeed the socket communication timeout and return None. Errors always occur when you run the emulator alone to interact or through the notebook And I always test this on WIFI.
@topbobo thanks for proving I'm not crazy. Can you confirm that the env doesn't communicate with the notebook at all if no WIFI is present?
1.Open WIFI It will report an error until the function "model.lean()".
2.Without WIFI It will report an error when Environment initialization.
All of the above was tested under the latest version of git repository. Ubuntu16.04, Cuda10.2 tensorflow1.14 Because the wireless on my workstation was down, I had to use WIFI for the time being.
Its port runs when WIFI open
When I connected to the network using a USB Ethernet Adapter without WIFI, I was very glad that I had succeeded!!! @JD-ETH @griffith826 @ZacRavichandran
Nice, so this issue only appears when connected via Wifi, good to know.
May I still get a fix for the WIFI case though? I don't have an adapter at home.
Thanks for sharing the ways to recreate this. I hope we have some good new; apologies it took a while to recreate on our end. I think we have a temporary solution and will come up with a change in the simulator, too, that provides a better user experience with wifi.
I was able to recreate your observation running the jupyter notebook (number 2 in your list above). Once I had two simulators running, it started to bring my wifi down, too. Apparently, running one simulator wasn't quite enough for me.
Anyway, the following temporary solution worked for me.
sudo ufw enable
sudo ufw deny out on wlo1 to any port 9000:9020 proto udp
This will enable your firewall, then add a rule to disable UDP on your wifi device for the ports that the simulator uses. The name of my wifi device is wlo1
, so please replace that with the name of your device. I chose 9000:9020
to cover the ports that would get used if you ran our demonstration notebook with four simulators running simultaneously. You can determine which ports are being used by running sudo netstat -tulpn | grep goseek
, if you need to modify that.
Please give it a try and do let us know if this helps.
Did it work for you? I tried before already with iptables
. Now I don't get response. This is the output applying those:
tesse_gym.core.utils.TesseConnectionError: Cannot receive data from the simulator. The connection is blocked or the simulator is not running.
File "/home/jd/competition/goseek/Fruit-Ninja-Camp/src/envs/utils.py", line 19, in _thunk
target_found_reward=target_found_reward,
File "/home/jd/competition/goseek/Fruit-Ninja-Camp/src/tesse_gym/tasks/goseek/goseek.py", line 97, in __init__
ground_truth_mode=ground_truth_mode,
File "/home/jd/competition/goseek/Fruit-Ninja-Camp/src/tesse_gym/core/tesse_gym.py", line 158, in __init__
self._init_pose()
File "/home/jd/competition/goseek/Fruit-Ninja-Camp/src/tesse_gym/core/tesse_gym.py", line 376, in _init_pose
metadata_response = self._data_request(MetadataRequest())
File "/home/jd/competition/goseek/Fruit-Ninja-Camp/src/tesse_gym/core/tesse_gym.py", line 372, in _data_request
raise TesseConnectionError()
tesse_gym.core.utils.TesseConnectionError: Cannot receive data from the simulator. The connection is blocked or the simulator is not running.
Traceback (most recent call last):
File "train-agent.py", line 105, in <module>
main()
File "train-agent.py", line 92, in main
env = make_unity_env(simfile, range(params['env_num']), params['targets_num'], params['ep_len'], params['reward'])
File "/home/jd/competition/goseek/Fruit-Ninja-Camp/src/envs/utils.py", line 25, in make_unity_env
return SubprocVecEnv([make_env(i) for i in num_env])
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/site-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 98, in __init__
observation_space, action_space = self.remotes[0].recv()
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/jd/anaconda3/envs/goseek/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
(goseek) jd@trumptower:~/competition/goseek/Fruit-Ninja-Camp/src
@griffith826
This turned out to be complex. In your training scripts, please try adding a step_rate
argument to the constructor for GoSeekFullPerception
like below. I'm even more hopeful this will help.
def make_unity_env(filename, num_env):
""" Create a wrapped Unity environment. """
def make_env(rank):
def _thunk():
env = GoSeekFullPerception(
str(filename),
network_config=get_network_config(worker_id=rank),
n_targets=n_targets,
episode_length=episode_length,
scene_id=scene_id[rank],
target_found_reward=target_found_reward,
step_rate=20 # add this argument
)
return env
return _thunk
return SubprocVecEnv([make_env(i) for i in range(num_env)])
The default value for step_rate
is None
, which ends up causing a problem in the simulator when we have our network devices disabled or are blocking certain ports. However, we can set a value here. eval.py actually sets it, which is why that script was still working.
Fingers crossed!
Yes that seems to work. Hard to see how the rate can related to simulation tho :nerd_face: Thanks!
Good job at making the interface! It has been fun looking for fruits. I have some feedback to the simulator though:
env.render()
.