carla-simulator / carla

Open-source simulator for autonomous driving research.
http://carla.org
MIT License
11.2k stars 3.63k forks source link

TCP Connection Error Always Breaks (Closed or Refused) #263

Closed bhairavmehta95 closed 6 years ago

bhairavmehta95 commented 6 years ago

I am running CARLA across 4 GPUs on a server using the documentation provided in the setup docs, and using them to generate experience for a reinforcement learning agent.

My main issue is that during training, my server seems to close the connection (not necessarily in the beginning of training, but rather approximately 12K timesteps), despite having both the client and server timeouts set to extremely high numbers. The interesting thing is that if I don't run this across multiple GPUs, it doesn't seem to ever close.

My code used to look like this:

     self.client.start_episode(self.start_idx)
     measurements, sensor_data = self.client.read_data()

But I would always get a TCP Error on the start_episode line. Using some of the work done by NervanaSystems and their CARLA wrapper, I changed my code to look similar to theirs (i.e connect if you get a TCP error on start_episode), but since the connection is either closed / refused, this also times out and then my environment crashes, which stops training on all of my agents.

    # Blocking function until episode is ready
    while True:
        try:
           self.client.start_episode(self.start_idx)
           break
        except TCPConnectionError:
           self.client.connect()
           print('TCP Connection error: {}'.format(self.port))
           time.sleep(1)

I am using 8 workers of PPO, a synchronous RL algorithm. I know A3C as described in the paper would be able to get around this problem by restarting the server and then reconnecting the client without interrupting the training of other agents due to the asynchrony. Is there anything that can be done about this? I am not super sure what else I could be doing to help with this problem, so I wanted to post this and see if anyone could find some incorrect logic in what I am doing here. (This code lies in the reset function of my agent environment)

YashBansod commented 6 years ago

@bhairavmehta95 can you provide with the exact error message that you are getting. Also, are you sure that it is an algorithmic problem and not just an issue of TCP/IP port allocation? Did you see if just changing the port number solves the problem? It may be possible that the default port (2000) set in carla is populated by another program on the server.

bhairavmehta95 commented 6 years ago

Unfortunately, I don't believe that it is just a port allocation problem, as I am only running the servers on free ports in the 9000+ range (not the 2000 one)

nsubiron commented 6 years ago

@bhairavmehta95 Could you post the server log when a disconnect occurs?

bhairavmehta95 commented 6 years ago
[2018.03.06-15.55.43:219][  0]LogCarla: Player Vehicle        = Default
[2018.03.06-15.55.43:219][  0]LogCarla: Number Of Vehicles    = 0
[2018.03.06-15.55.43:219][  0]LogCarla: Number Of Pedestrians = 0
[2018.03.06-15.55.43:219][  0]LogCarla: Weather Id = 1
[2018.03.06-15.55.43:219][  0]LogCarla: Seed Vehicle Spawner = 30200
[2018.03.06-15.55.43:219][  0]LogCarla: Seed Pedestrian Spawner = 13085
[2018.03.06-15.55.43:219][  0]LogCarla: Found 15 available weather settings.
[2018.03.06-15.55.43:219][  0]LogCarla:   * 0 - Default
[2018.03.06-15.55.43:219][  0]LogCarla:   * 1 - ClearNoon
[2018.03.06-15.55.43:219][  0]LogCarla:   * 2 - CloudyNoon
[2018.03.06-15.55.43:219][  0]LogCarla:   * 3 - WetNoon
[2018.03.06-15.55.43:219][  0]LogCarla:   * 4 - WetCloudyNoon
[2018.03.06-15.55.43:219][  0]LogCarla:   * 5 - MidRainyNoon
[2018.03.06-15.55.43:219][  0]LogCarla:   * 6 - HardRainNoon
[2018.03.06-15.55.43:219][  0]LogCarla:   * 7 - SoftRainNoon
[2018.03.06-15.55.43:219][  0]LogCarla:   * 8 - ClearSunset
[2018.03.06-15.55.43:219][  0]LogCarla:   * 9 - CloudySunset
[2018.03.06-15.55.43:219][  0]LogCarla:   * 10 - WetSunset
[2018.03.06-15.55.43:219][  0]LogCarla:   * 11 - WetCloudySunset
[2018.03.06-15.55.43:219][  0]LogCarla:   * 12 - MidRainSunset
[2018.03.06-15.55.43:219][  0]LogCarla:   * 13 - HardRainSunset
[2018.03.06-15.55.43:219][  0]LogCarla:   * 14 - SoftRainSunset
[2018.03.06-15.55.43:219][  0]LogCarla: [CARLA/SceneCapture]
[2018.03.06-15.55.43:219][  0]LogCarla: Added 1 cameras.
[2018.03.06-15.55.43:219][  0]LogCarla: Semantic Segmentation = Disabled
[2018.03.06-15.55.43:219][  0]LogCarla: [CARLA/SceneCapture/CameraRGB]
[2018.03.06-15.55.43:219][  0]LogCarla: Image Size = 800x600
[2018.03.06-15.55.43:219][  0]LogCarla: FOV = 90.000000
[2018.03.06-15.55.43:219][  0]LogCarla: Camera Position = (X=30.000 Y=0.000 Z=130.000)
[2018.03.06-15.55.43:219][  0]LogCarla: Camera Rotation = (P=0.000000 Y=0.000000 R=0.000000)
[2018.03.06-15.55.43:219][  0]LogCarla: Post-Processing = SceneFinal
[2018.03.06-15.55.43:219][  0]LogCarla: ================================================================================
[2018.03.06-15.55.43:242][  0]LogWorld: Bringing up level for play took: 1.894903
[2018.03.06-15.55.43:243][  0]LogCarlaServer: Sending 83 available start positions
[2018.03.06-15.55.43:244][  0]LogCarlaServer: Episode start received: { StartIndex = 0 }
[2018.03.06-15.55.43:247][  0]LogCarla: Created capture camera 0 with postprocess "SceneFinal"
[2018.03.06-15.55.43:247][  0]LogCarla: Changing weather settings to "ClearNoon"
[2018.03.06-15.55.43:248][  0]LogCarlaServer: Ready to play, notifying client
[2018.03.06-15.55.43:253][  0]LogCarla: Found 83 positions for spawning vehicles
[2018.03.06-15.55.43:253][  0]LogCarla: Found 110 positions for spawning walkers at begin play.
[2018.03.06-15.55.43:253][  0]LogCarla: Found 42 positions for spawning walkers during game play.
[2018.03.06-15.55.43:254][  0]LogLoad: Took 7.264515 seconds to LoadMap(/Game/Maps/Town02)
[2018.03.06-15.55.44:241][  0]LogLoad: (Engine Initialization) Total time: 12.54 seconds
[2018.03.06-15.55.44:242][  0]LogExternalProfiler: No external profilers were discovered.  External profiling features will not be available.
[2018.03.06-15.55.44:262][  0]LogInit: Adding external input plugin.
[2018.03.06-15.55.44:439][  1]LogRenderer: Reallocating scene render targets to support 800x600 Format 10 NumSamples 1 (Frame:1).
[2018.03.06-15.55.46:189][  1]LogRenderer: Reallocating scene render targets to support 400x300 Format 10 NumSamples 1 (Frame:2).
[2018.03.06-15.55.47:258][  1]LogLinux: Setting swap interval to 'Immediate'

This is the last few lines; nothing out of the ordinary, it just doesn't move forward from here.

nsubiron commented 6 years ago

There is no disconnection in that fragment, if it stops printing output there it usually means that it's running normally. Are you sure there is no other client running on that same port?

Maybe you can find somewhere in the log where a message similar to this appears?

ERROR: tcpserver 2002 : error reading message: End of file 
ERROR: tcpserver 2000 : error reading message: End of file 
[2018.03.06-16.38.00:413][501]LogCarlaServer: Warning: Client disconnected, server needs restart
[2018.03.06-16.38.00:413][501]LogCarlaServer: Restarting the level...
bhairavmehta95 commented 6 years ago

I could not, unfortunately that is the end of the log.

The other thing is that on the client side, I get errors that look like this:

Starting new episode...
Error during step, terminating episode early Traceback (most recent call last):
  File "/home/bhairavm/ray/examples/carla/env.py", line 309, in step
    obs = self._step(action)
  File "/home/bhairavm/ray/examples/carla/env.py", line 344, in _step
    image, py_measurements = self._read_observation()
  File "/home/bhairavm/ray/examples/carla/env.py", line 429, in _read_observation
    measurements, sensor_data = self.client.read_data()
  File "/home/bhairavm/carla/PythonClient/carla/client.py", line 117, in read_data
    data = self._stream_client.read()
  File "/home/bhairavm/carla/PythonClient/carla/tcp.py", line 73, in read
    header = self._read_n(4)
  File "/home/bhairavm/carla/PythonClient/carla/tcp.py", line 91, in _read_n
    raise TCPConnectionError(self._logprefix + 'connection closed')
carla.tcp.TCPConnectionError: (localhost:58212) connection closed

which make me believe that the connection was actually closed.

bhairavmehta95 commented 6 years ago

For now, I have just be restarting the server in an asynchronous way so that my training does not crash the moment one TCP connection is closed / broken / refused, but I would also like to find if there is a way to keep the connection open more reliably.

nsubiron commented 6 years ago

Can you see the server window or are you running off-screen?

Can it be that the server hangs? It would make sense that it works running a single instance of Carla but it crashes/hangs with multiple instances (running out of GPU memory or some other issue).

bhairavmehta95 commented 6 years ago

I'm running off screen. I thought of that, but I'm running on a server of 256GB of GPU Memory, and I've made sure to try to equally distribute the load across all four GPUs. I don't believe that that is the case, as each card has 64GB and each render process seems to take < 1.5 GB at most.

Did your team ever have these issues when running the A3C experiments in the paper?

nsubiron commented 6 years ago

We saw crashes when several instances of CARLA run on the same GPU (more than 4 or so), even if there is enough memory available, but we haven't observed issues like this.

Ajk4 commented 6 years ago

I have similar problems.

I noticed, that before timing-out exception is thrown all simulators freeze as if they would be waiting for next action in synchronous mode.

@nsubiron Could it be the case, that send action request gets lost over the network and python client will never attempt to send new request? And the simulator will froze forever waiting for next action because of synchronous mode?

nsubiron commented 6 years ago

I've seen this issue when there is some error at GPU level, suddenly all the GPU dependent apps freeze. If the simulators freeze for more than the time-out (10s by default), the connection is closed. It's going to be very difficult to track down this bug, and only seems to happen in some GPUs, usually when running more than one simulator.

Could it be the case, that send action request gets lost over the network and python client will never attempt to send new request?

The client connects through TCP, a packet should not be lost without raising an error [1]. Though if there were a big latency the error would be "connection timed-out".

TairuiWang commented 6 years ago

Hi guys,

I had the same problem. I think the problem is because your client is trying to read data from your server while your server is not ready yet. I solve this by setting a sleep time before my client reading data from a new episode. 2-3 seconds works perfectly for me.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

nsubiron commented 6 years ago

This should not be an issue in 0.9.0, the protocol should be able now to recover from a broken connection without restarting the whole simulation.