Synchronous mode - tick hangs

Vaan5 commented 4 years ago

Intro

I am running a simple script in synchronous mode which does the following:

spawn an ego vehicle
in a loop do:
- set velocity
- initiate tick
- wait to receive data
- store it.

Here is the code (i've simplified the synchronous mode example):

import carla
import queue

snapshots = queue.Queue()

def push_snapshot(snapshot):
    snapshots.put(snapshot)

def get_position(frame_id, vehicle_id):
    snapshot = snapshots.get(timeout=2)
    assert(frame_id == snapshot.frame)
    transform = snapshot.find(vehicle_id).get_transform()
    return transform.location.x, transform.location.y, transform.location.z

def main():
    sim_time = 0.0
    sim_time_end = 20.0
    delta_time = 0.05
    results = []

    client = carla.Client("127.0.0.1", 2000)
    world = client.get_world()

    # set mode to synchronous
    frame_id = world.apply_settings(carla.WorldSettings(True, False, delta_time))
    print("Synchronous mode will be applied in frame: {}".format(frame_id))

    # spawn car
    spectator = world.get_spectator()
    blueprint_library = world.get_blueprint_library()
    map_ = world.get_map()
    vehicles = blueprint_library.filter("vehicle")
    bp = vehicles[0]
    vehicle = world.spawn_actor(bp, map_.get_spawn_points()[0])
    world.tick()    # just so that I see it in the simulator

    world.on_tick(push_snapshot)

    cmd = [carla.command.ApplyVelocity(vehicle.id, carla.Vector3D(y=2.0))]    # constant for the minimal example
    while sim_time <= sim_time_end:
        client.apply_batch_sync(cmd)
        print("before tick")
        frame_id = world.tick()   # on newer versions raises an Exception from which you can't rly recover
        print("after tick {}".format(frame_id))
        sim_time += 0.05
        results.append((sim_time, get_position(frame_id, vehicle.id)))

    world.apply_settings(carla.WorldSettings(False, False, 0))
    vehicle.destroy()

if __name__ == '__main__':
    main()

Problem

The problem happens when running this multiple times, sometimes the tick call will just hang. I originally encountered this with the C++ API (0.9.7). After that, I was able to reproduce it with the PythonAPI on 0.9.7, 0.9.8, 0.9.9 as well.

In case of 0.9.7, the script would just hang forever. With the newer versions, an exception would be thrown because of the recently added tick timeout (#2556). But in my opinion this just hides the root problem.

In my opinion this is really a big shortcoming of Carla and the synchronous mode, and it basically makes the synchronous mode not that usable. If you want to have reproducible results, it is not possible as the frame (corresponding to the "hanging" tick call) would be dropped. The exception being thrown in newer versions due to a timeout is also not useful as you can't really recover from it.

I've already found a few related issues:

2556
2070 - sadly closed, but I can see that the author and @nsubiron also commented about the same problem. It seems like the issue was forgotten.
1803
2581
2038

So here are my questions:

Do you agree with me that this is a problem (and can be reproduced)?
Is anything already planned regarding the fix (as I can see in the linked issue that @nsubiron also suggests getting rid of the cause of the problem instead of applying workarounds?

Additional info

IMPORTANT: The described problem doesn't occur all the time. You might have to run the script multiple times to see it. I typically run it like:

FOR /L %v IN (1,1,50) DO python tick_hangs.py

Environment

OS: Win 10 Python: 3.7 Carla versions:

0.9.7 (locally built from 1f7669e4b308aa8e8f7612dde6666ba6eda14e9f) - tried both the python and C++ API
0.9.8 (official package) - tried only the python API
0.9.9 (official package) - tried only the python API

nsubiron commented 4 years ago

I totally agree, the synchronous mode should not drop images, that breaks significantly the functionality. I've been working on a fix cause this has been bothering me a lot lately, but so far I couldn't get anything that works well (doing it on my free time so couldn't put a lot of effort). The problem is that needs to work both for sync and async modes, and async mode needs to drop images, so it's not so easy to find a solution that works well for both.

hh0rva1h commented 3 years ago

@nsubiron Would be really great if you could escalate the priority of this issue among devs as this is a major problem for reinforcement learning research with Carla.

germanros1987 commented 3 years ago

@doterop This issue has high priority, could you and @bernatx take a look at this?

hh0rva1h commented 3 years ago

@germanros1987 Thanks for triaging this. We looked again into it last night and found a non documented way to handle it: @doterop added a timeout for ticks in march: https://github.com/carla-simulator/carla/pull/2556/commits/5d238b2921a9af2568f43a488e90eb7bb3451b21 so the following code works around the problem:

try:
    world.tick(0.1)
except:
    print("WARNING: tick not received")

It seems the tick is done in the simulator just that the client never gets the signal.

Maybe for the time being until this is fixed

In the API it should be documented that world.tick() takes a timeout argument
This issue / workaround should be documented in the sync mode documentation

Note that we also tested to recompile Carla with the workaround proposed in https://github.com/carla-simulator/carla/pull/2070 and it did not reliably workaround the issue.

nsubiron commented 3 years ago

I can comment on what I've found checking this not long ago.

The origin of the frame drops is here https://github.com/carla-simulator/carla/blob/baf43b07a607e20bc338ff86d8829aae49f7478c/LibCarla/source/carla/streaming/detail/tcp/ServerSession.cpp#L80-L83

That boolean is set to false when the completion callback is called and the socket starts accepting messages again. In async mode this is fine cause we cannot keep an infinite queue anyway (though it would be nice if the number of drops could be reduced), but in sync mode this can cause a deadlock. So I tried adding a double-buffer for messages, basically a queue of 2 elements, that should have solved the issue in sync mode since the server would hang until the client received the message and you have a buffer of 1 frame that compensates (since render is async you could still overlap), but to my surprise the frame drops still happened. I think one possible cause is that the callback is not called immediately after sending the message and somehow can be postponed, this is handled by the ASIO context, so even though the client already detected and forwarded the message to the API the server session is still marked as "_is_writing" when reading the next message.

doterop commented 3 years ago

Hi @Vaan5,

I've been struggling the last few days trying to reproduce this error. Although all points out in the direction that @nsubiron mentioned before, it would very handy for me to have a scenario where I can reproduce the error and therefore create a proper solution for both sync and async mode.

If you don't have this case, no worries. :)

Thanks!

Vaan5 commented 3 years ago

Hi, sry for the late reply.

Sadly i don't have a different scenario. The script i wrote in the first post is the smallest I could think of.

I tried it again yesterday with:

Carla 0.9.9.4 (downloaded from the github link) Python 3.7.7 Windows 10 (sry I don't have an ubuntu so I can't try it on one)

Here is exactly what i did:

Start Carla Server
Open console; add the carla egg to the PYTHONPATH variable
Run python tick_hangs.py // I named the script tick_hangs.py (i just copied the code from my first post into it - without any modifications) 100 times in a loop (FOR /L %v IN (1,1,100) DO python tick_hangs.py)

Out of the 100 runs, it happened once (0 times on the second try, 2 times on the third). So it still happens. Here is an output file of the 100 runs: out.txt (just search for RuntimeError or Error).

Edit: It also occurrs if you those setter-like functions are used instead of commands (like in #2812) - might come in handy for async mode checks.

hh0rva1h commented 3 years ago

@doterop Just run the following script:

import carla

# connect to Carla
client = carla.Client("localhost", 2000)
world = client.load_world("Town04")

# set fixed time step length
settings = world.get_settings()
settings.fixed_delta_seconds = 0.05
settings.synchronous_mode = True
settings.no_rendering_mode = True
world.apply_settings(settings)

# do ticks, will deadlock eventually (will timeout after 10 seconds)
while True:
    print("Issuing tick ...")
    world.tick()
    print("Tick successful")

For me it takes less than a minute to trigger the issue:

Issuing tick ...
Tick successful
Issuing tick ...
Tick successful
Issuing tick ...
Traceback (most recent call last):
  File "tick_hang.py", line 17, in <module>
    world.tick()
RuntimeError: time-out of 10000ms while waiting for the simulator, make sure the simulator is ready and connected to localhost:2000

How easy it is to trigger that issue might depend on the hardware: on my laptop with integrated graphics only this is very easy trigger, on very fast machines with dedicated gpu it can take long to trigger.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

hh0rva1h commented 3 years ago

@bernatx Where did this end up in? I failed to find the commit in master / dev, but the pull request ist closed?

hh0rva1h commented 3 years ago

@germanros1987 @bernatx I tested release 0.9.10 for the issue and it's still happening. Did the linked commit not make it into the release?

bernatx commented 3 years ago

@hh0rva1h I have been running the script several times, each time for 50 times running around 20 seconds, and never happened the problem. I have tried in Town03 and also in Town04. I'm using version 0.9.10.1 where we add a fix for a camera freeze we detected, but I don't think it is related to this. I will do more tests in Ubuntu also and let you know. In the meantime, can you check also that 0.9.10.1 version?

hh0rva1h commented 3 years ago

@bernatx Thanks for letting me know, I fetched 0.9.10.1 now, but it's still happening to me: https://youtu.be/X3WZ-tmltWo I guess 20 seconds are not long enough, it can take time until the problem occurs, also keep in mind that the problem is harder to trigger on beefy machines. I am reproducing this on an XPS 13 9360 from 2016, a dual core i5 machine with no dedicated graphics.

bernatx commented 3 years ago

ok, I have been able to reproduce using a package a let the script a bit more of time to run. It happens at random, but now I can try to check further. I will tell you what I find, thank you.

bernatx commented 3 years ago

Hi, I have opened a PR fixing this problem (https://github.com/carla-simulator/carla/pull/3394). It seems it was a race condition and a wrong management of a handler, that was queued instead of executed in real time.

hh0rva1h commented 3 years ago

@bernatx Tried the nightly build, looks good so far. Thanks very much!

nsubiron commented 3 years ago

Left some comments on #3394, it fixes this issue, but I think introduces other issues.

Vaan5 commented 3 years ago

Probably isn't of much help (as I'm quite late): tried it with 0.9.10 and I couldn't reproduce it anymore.

varunjammula commented 3 years ago

This behavior is not deterministic. The tick hang occurs most of the times but the intended behavior magically works only some times

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

egeonat commented 3 years ago

I'm running CARLA version 0.9.11 on Ubuntu and I am still experiencing this issue in synchronous mode with 1/20 fixed_delta_seconds. Is there any progress? Were the issues specified in the comments by @nsubiron fixed?

varunjammula commented 3 years ago

@egeonat Did you try the synchronous_mode in the PythonAPI examples folder? They changed the way the synchronous mode works, I think. If you follow the example, it should work and this issue will not happen again. (https://github.com/carla-simulator/carla/blob/master/PythonAPI/examples/synchronous_mode.py)

egeonat commented 3 years ago

@varunjammula I am trying to use CARLA for RL training so my code has some differences, but I have mostly followed the synchronous_mode.py example. I believe my problem might be slightly different than the original poster, as when I call world.tick() the call doesn't timeout. It simply hangs forever. That is why I was wondering if perhaps new issues were introduced to the simulator with the PR, as suggested by nsublron.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

hh0rva1h commented 2 years ago

@egeonat You can specify a timeout for world.tick, e.g. world.tick(1), then it wont simpy hang forever. You can catch that with a try block.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Julbak commented 1 year ago

Jumped to the version 0.9.14 and managed to hang the server, task manager and closing the process works. Exactly the same code worked with 0.9.13. I'm using the quick start package installation as the other option was painful beyond anything.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

SExpert12 commented 5 months ago

Intro

I am running a simple script in synchronous mode which does the following:

spawn an ego vehicle

in a loop do:

set velocity

initiate tick

wait to receive data

store it.

Here is the code (i've simplified the synchronous mode example):
import carla
import queue

snapshots = queue.Queue()

def push_snapshot(snapshot):
    snapshots.put(snapshot)

def get_position(frame_id, vehicle_id):
    snapshot = snapshots.get(timeout=2)
    assert(frame_id == snapshot.frame)
    transform = snapshot.find(vehicle_id).get_transform()
    return transform.location.x, transform.location.y, transform.location.z

def main():
    sim_time = 0.0
    sim_time_end = 20.0
    delta_time = 0.05
    results = []

    client = carla.Client("127.0.0.1", 2000)
    world = client.get_world()

    # set mode to synchronous
    frame_id = world.apply_settings(carla.WorldSettings(True, False, delta_time))
    print("Synchronous mode will be applied in frame: {}".format(frame_id))

    # spawn car
    spectator = world.get_spectator()
    blueprint_library = world.get_blueprint_library()
    map_ = world.get_map()
    vehicles = blueprint_library.filter("vehicle")
    bp = vehicles[0]
    vehicle = world.spawn_actor(bp, map_.get_spawn_points()[0])
    world.tick()    # just so that I see it in the simulator

    world.on_tick(push_snapshot)

    cmd = [carla.command.ApplyVelocity(vehicle.id, carla.Vector3D(y=2.0))]    # constant for the minimal example
    while sim_time <= sim_time_end:
        client.apply_batch_sync(cmd)
        print("before tick")
        frame_id = world.tick()   # on newer versions raises an Exception from which you can't rly recover
        print("after tick {}".format(frame_id))
        sim_time += 0.05
        results.append((sim_time, get_position(frame_id, vehicle.id)))

    world.apply_settings(carla.WorldSettings(False, False, 0))
    vehicle.destroy()

if __name__ == '__main__':
    main()
Problem

The problem happens when running this multiple times, sometimes the tick call will just hang. I originally encountered this with the C++ API (0.9.7). After that, I was able to reproduce it with the PythonAPI on 0.9.7, 0.9.8, 0.9.9 as well.

In case of 0.9.7, the script would just hang forever. With the newer versions, an exception would be thrown because of the recently added tick timeout (#2556). But in my opinion this just hides the root problem.

In my opinion this is really a big shortcoming of Carla and the synchronous mode, and it basically makes the synchronous mode not that usable. If you want to have reproducible results, it is not possible as the frame (corresponding to the "hanging" tick call) would be dropped. The exception being thrown in newer versions due to a timeout is also not useful as you can't really recover from it.

I've already found a few related issues:

Added timeout to World Tick #2556

Improve sync mode performance #2070 - sadly closed, but I can see that the author and @nsubiron also commented about the same problem. It seems like the issue was forgotten.

Fixes for synchronous mode #1803

Synchronous_mode gets stuck in RL training #2581

Synchronous_mode gets stuck in endless loop #2038

So here are my questions:

Do you agree with me that this is a problem (and can be reproduced)?

Is anything already planned regarding the fix (as I can see in the linked issue that @nsubiron also suggests getting rid of the cause of the problem instead of applying workarounds?

Additional info

IMPORTANT: The described problem doesn't occur all the time. You might have to run the script multiple times to see it. I typically run it like:
FOR /L %v IN (1,1,50) DO python tick_hangs.py
Environment

OS: Win 10 Python: 3.7 Carla versions:

0.9.7 (locally built from 1f7669e) - tried both the python and C++ API

0.9.8 (official package) - tried only the python API

0.9.9 (official package) - tried only the python API

Hi,

I want to retrieve data after every 10ms in my code. In this code I am not able to understand that why there is world_tick() and world_on_tick() after the vehicle spawn statement is written? Can any one help me to understand?

How to write code if I need to update vehicle location after every 10 ms, so I can change the car crossing order?

Thanks in advance

carla-simulator / carla