colobot / colobot

Source code of open-source Colobot: Gold Edition project developed by Epsitec and TerranovaTeam
http://colobot.info/
GNU General Public License v3.0
1.12k stars 191 forks source link

goto() doesn't report reaching to destination point in 32-bit binaries. #844

Closed ghost closed 7 years ago

ghost commented 8 years ago

Possibly recent changes in CBOT engine broke something. The bot just grabs the BlackBox and fly to first checkpoint. Then just flies in circle instead flying through next checkpoints until his PowerCell run out of energy.

His script (data/ai/run1.txt) has never changed through GOLD development.

melex750 commented 8 years ago

I can't reproduce this, whether compiling in Release mode or Debug. I think it's more likely agoto()bug.

If I could reproduce it, I'd edit mission 4.3 scene.txt to make the rogue bot selectable and check if the program gets stuck on thegoto()command.

This seems to reproduce it on any winged bot.

extern void object::New()
{
    point pos = this.position;

    goto(pos, pos.z, 1, 1);
}
ghost commented 8 years ago

Indeed it stuck on the goto() command:

schowek01

tomangelo2 commented 8 years ago

Happens every time for me, and it's clearly goto() bug, it stuck on first checkpoint. It looks like changing

goto(circuit[i], circuit[i].z, 1,1);

to

goto(circuit[i], circuit[i].z, 0,1);

which cause bot to stop at every checkpoint fixes goto behavior. edit: I assume that this and issue #845 are both caused by broken checking if bot already reached destination point.

tomangelo2 commented 8 years ago

This and #845 are caused by goto doesn't reporting that it already reached destination. It looks like it occur only on 32-bit compilations, as every 64-bit compiler (Windows & Linux, Release & Debug) works fine.

tomangelo2 commented 7 years ago

I've done some investigation in there. It fails in this function, as long as dist > 10 everything is ok, but when we come close enough m_lastDistance variable suddenly reset itself to 1000 every frame, making one of conditions of if statement false, thus not aborting it. When robot finally gets far enough, everything goes back to normal, m_lastDistance is calculated correctly.. And again, when robot is close enough it suddenly reset. This variable only resets itself at line 745, but as far as I understand function Start should be called only at start of goto() function, but it's called every time we're close to goal. There is one more thing, it fix itself when we print dist variable to console. Other variables doesn't fix that. Does it prevent some optimizations by compiler?

melex750 commented 7 years ago

I forgot I had this on my old laptop, seems to work ok.

--- a/src/object/task/taskgoto.cpp
+++ b/src/object/task/taskgoto.cpp
@@ -916,7 +916,10 @@ Error CTaskGoto::IsEnded()
     if ( m_goalMode == TGG_EXPRESS )
     {
         dist = Math::DistanceProjected(m_goal, pos);
-        if ( dist < 10.0f && dist > m_lastDistance )
+        float margin = 10.0f;
+        if ( m_object->Implements(ObjectInterfaceType::Flying) ) margin = 20.0f;
+
+        if ( dist < margin && dist > m_lastDistance )
         {
             return ERR_STOP;
         }

Maybemargincould be adjusted depending on current speed of the bot.

tomangelo2 commented 7 years ago

Well, that doesn't work for me. Maybe it fix other compilers behavior.

I've figured out marking dist variable as volatile seems to fix that, at least in my case.

tomangelo2 commented 7 years ago

I've applied both fixes, it shouldn't happen anymore.

immibis commented 5 years ago

Sounds like a memory corruption bug. I wouldn't trust this fix.

hexagonrecursion commented 3 months ago

Technical info backfill: I was able to reproduce this issue when running the mission 4.3 in colobot-gold-0.1.10 downloaded from https://colobot.info/files/releases/0.1.10-alpha/colobot-gold-0.1.10-alpha-win32.zip under wine

TODO: find the real root cause of the bug

hexagonrecursion commented 3 months ago

Progress: I figured out how to compile the game in a way that reproduces this bug. It turned out that I needed to compile it in release mode

melex750 commented 3 months ago

Have you reproduced on more recent versions of the game?

hexagonrecursion commented 3 months ago

More recent releases contain the "fix" that hides the problem by adding volatile. I want to figure out the actual root cause of this because the bug may still be present. Ultimately I want to be able to remove the unnecessary volatile

melex750 commented 3 months ago

There is a clue about the root cause in the diff I posted above (7 years ago). It was usually more reproducible on a very slow computer because the bot would pass the destination before checking if it was near the destination.

hexagonrecursion commented 3 months ago

There is a clue about the root cause in the diff I posted above (7 years ago). It was usually more reproducible on a very slow computer because the bot would pass the destination before checking if it was near the destination.

Slow computer? It looks like there were two bugs.

  1. One was fixed by you (goto() overshooting if the speed is high and frames per second is low) (the slow computer case you described)
  2. The other was described by tomangelo2 "when we come close enough m_lastDistance variable suddenly reset itself to 1000"

I am troubleshooting the second bug

hexagonrecursion commented 3 months ago

I've done some investigation in there. It fails in this function, as long as dist > 10 everything is ok, but when we come close enough m_lastDistance variable suddenly reset itself to 1000 every frame, making one of conditions of if statement false, thus not aborting it. When robot finally gets far enough, everything goes back to normal, m_lastDistance is calculated correctly.. And again, when robot is close enough it suddenly reset. This variable only resets itself at line 745, but as far as I understand function Start should be called only at start of goto() function, but it's called every time we're close to goal. There is one more thing, it fix itself when we print dist variable to console. Other variables doesn't fix that. Does it prevent some optimizations by compiler?

  1. I have set a breakpoint on the first line of CTaskGoto::Start(). Once I get close enough to the destination I start hitting the breakpoint every couple of frames. After each hit m_lastDistance becomes 1000. It appears CTaskGoto::Start() is re-executed for some reason
  2. CTaskGoto is a foreground task started by CBot. It is supposed to be terminated by WaitForForegroundTask(). When we get close to the destination, CTaskExecutorObjectImpl::EndedTask() calls CTaskGoto::IsEnded() and gets ERR_STOP. This causes the task to end before CScriptFunctions::rGoto() detects that it ended. CScriptFunctions::rGoto() gets confused and starts the task again.
hexagonrecursion commented 3 months ago

I have stared into assembly and assembly stared back into me :squid: X__X :octopus:

What is supposed to happen:

  1. The bot moves
  2. CScriptFunctions::rGoto() calls WaitForForegroundTask() which calls CTaskGoto::IsEnded()
    • If CTaskGoto::IsEnded() returns ERR_STOP, we remove the foreground task and return from goto()
    • Otherwise we go to step 3
  3. CTaskExecutorObjectImpl::EndedTask() calls CTaskGoto::IsEnded() and gets ERR_CONTINUE because the position of the bot stayed the same since the step 2
  4. Go back to 1

What actually happens:

  1. The bot moves
  2. CScriptFunctions::rGoto() calls WaitForForegroundTask() which calls CTaskGoto::IsEnded()
    • dist > m_lastDistance is false because we moved closer to the destination
    • CTaskGoto::IsEnded() stores m_lastDistance = dist
    • CTaskGoto::IsEnded() returns ERR_CONTINUE
  3. CTaskExecutorObjectImpl::EndedTask() calls CTaskGoto::IsEnded()
    • We now expect dist > m_lastDistance to evaluate to false because we expect m_lastDistance == dist
    • Unfortunately m_lastDistance is a 32 bit float, but dist is an 80 bit long double due to compiler optimizations:

      For legacy x86 processors without SSE2 support, and for m68080 processors, GCC is only able to fully comply with IEEE 754 semantics for the IEEE double extended (long double) type. Operations on IEEE double precision and IEEE single precision values are performed using double extended precision.

    • dist > m_lastDistance to evaluates to true because dist is very slightly bigger than m_lastDistance
    • CTaskGoto::IsEnded() returns ERR_STOP
  4. CTaskExecutorObjectImpl::EndedTask() ends the task
  5. CScriptFunctions::rGoto() thinks the task has not been started yet and starts a new task
hexagonrecursion commented 3 months ago

Here is a godbolt link that demonstrates that sometimes a * a > a * a can evaluate to true,l:'5',n:'1',o:'C%2B%2B+source+%231',t:'0')),k:35.758060860248435,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:executor,i:(argsPanelShown:'1',compilationPanelShown:'0',compiler:g141,compilerName:'',compilerOutShown:'0',execArgs:'',execStdin:'',fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,libs:!(),options:'-O2+-m32',overrides:!(),runtimeTools:!(),source:1,stdinPanelShown:'1',tree:0,wrap:'1'),l:'5',n:'0',o:'Executor+x86-64+gcc+14.1+(C%2B%2B,+Editor+%231)',t:'0')),header:(),k:31.30793615582513,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:g141,filters:(b:'0',binary:'1',binaryObject:'1',commentOnly:'0',debugCalls:'1',demangle:'0',directives:'0',execute:'1',intel:'1',libraryCode:'0',trim:'1',verboseDemangling:'0'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,libs:!(),options:'-O2+-m32',overrides:!(),selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'+x86-64+gcc+14.1+(Editor+%231)',t:'0')),header:(),k:32.93400298392647,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4)

hexagonrecursion commented 3 months ago

This only affects 32 bit targets because on 64 bit targets gcc defaults to -mfpmath=sse

hexagonrecursion commented 3 months ago

It was usually more reproducible on a very slow computer because the bot would pass the destination before checking if it was near the destination.

@melex750 Do you recall what observations led you to believe this? I don't think a slow computer can explain the problem.

  1. By using the two scripts below I have measured the maximum horizontal speed of a flying bot: 12.72 meters per second
  2. The margin used to be 10 meters for all bots. If the bot passed the destination before checking, it had to have moved more than 6.6666... meters in a single frame (see math below). For the bot to overshoot the 10 meter radius the game would have to be running at less than 2 frames per second (nearly unplayable). Is this the "slow computer" scenario you experienced?

In my observations increasing the margin to 20 meters hides the bug rather than fixing it: CTaskGoto is stopped and restarted over and over until it gets lucky and survives long enough for CScriptFunctions::rGoto() to detect that the task has ended

extern void object::SpeedDisplay()
{
    object o = radar(WingedGrabber);
    float maxv = 0;
    while(true)
    {
        point p1 = o.position;
        float t1 = abstime();
        wait(1);
        point p2 = o.position;
        float t2 = abstime();
        float v = distance2d(p1, p2) / (t2 - t1);
        if(v > maxv) maxv = v;
        message(maxv);
    }
}
extern void object::WingedBot()
{
    point p = position;
    p.x += 300;
    goto(p);
}

Math

  1. Let the x axis pass through the starting point and the destination
  2. Let the 0 be the destination
  3. Let the direction of the axis be from the source to the destination
  4. Margin is 10 meters for all bots
  5. Let C be the point where the bot is when it overshoots the margin; C >= 10
  6. Let B be the point where the bot was one frame ago
  7. Let A be the point where the bot was two frames ago

What is the closest B can be to C? I.e. what is the smallest distance that the bot has to fly in a single tick to overshoot the margin?

Let's consider three scenarios

  1. B <= 0
  2. B > 0 and A <= -10
  3. B > 0 and A > -10

1) B <= 0; C >= 10; the shortest possible distance between B and C is 10 meters. 2) A <= -10; C >= 10; given a constant-ish speed, B is about half way between the two; the shortest possible distance between B and C is about 10 meters. 3) A is inside the margin and B is too

6.6666... < 10 therefore the answer is 6.6666...

melex750 commented 3 months ago

I recall believing it might two different bugs like you mentioned previously. I actually found the overshooting behavior on a different level and reproduced it with a minimal CBot program (see second comment near the top from 8 years ago). I found when goto tested the distance to destination when it was just over the 10 meter margin it would overshoot seemingly before it checked again. I didn't really believe it happened because of a slow computer, but I couldn't reproduce it on a faster one.

hexagonrecursion commented 3 months ago

Was it running at 2 frames per second? If not, I do not see how it could have possibly overshot the 10 meter margin

melex750 commented 3 months ago

I'm not sure about the FPS, it was almost unplayable except with the graphics settings set to minimum and 800x600 resolution. It was my nephew's first computer and we were creating a userlevel together. I have it still in storage, maybe I'll dust it off and check it out.

hexagonrecursion commented 3 months ago

Once I write a numerically stable fix you can test it.

melex750 commented 3 months ago

Looking forward to it.

hexagonrecursion commented 3 months ago

Math

I was confused. The margin was not 10 meters. It was 2.5 meters. I forgot that we use g_unit=4 to convert between in-game coordinates and in-game coordinates.

  1. 2.5 / 1.5 = 1.6666... meters in a single frame
  2. At fps below about 7.632 there is a risk of overshooting the margin of 2.5 meters. Raising the margin to 5 meters means we can guarantee goto will not overshhot unless fps is below 3
hexagonrecursion commented 3 months ago

This seems to reproduce it on any winged bot.

extern void object::New()
{
    point pos = this.position;

    goto(pos, pos.z, 1, 1);
}

This is actually a different bug entirely and it was never fixed at all (even though this specific test case no longer reproduces it). The bug is caused by goto() telling the winged bot to fly forward and turn (instead of just turning) when it is facing more than 90 degrees away from the destination. This results in the bot orbiting the destination.

This test case reproduces the orbiting in gold-0.2.1:

extern void object::WingedBot()
{
    point pos = position;
    pos.x += 10;
    turn(direction(pos));
    pos = position;
    pos.x -= 6;
    message("goto(");
    goto(pos, pos.z, 1, 1);
    message(")");
}

The reason your test case no longer reproduces the bug is subtle:

  1. Telling bot A to go to a position that is occupied by bot B tells bot A to go to a point 2.675 meters behind bot B. This happens even A and B are the same bot. Note: the position is not updated if bot B moves
  2. 2.675 is outside the old margin of 2.5 meters, but is inside the new margin of 5 meters. The bot starts to move forward, detects that it is inside the 5 meter margin and moving away from the destination and stops.
  3. Using an offset bigger than 5 (e.g. 6 in my test case) still reproduces the bug
hexagonrecursion commented 3 months ago

In my observations increasing the margin to 20 meters hides the bug rather than fixing it: CTaskGoto is stopped and restarted over and over until it gets lucky and survives long enough for CScriptFunctions::rGoto() to detect that the task has ended

This is incorrect. Here is the real cause and effect chain:

  1. Once the bot is inside the margin, it essentially flips a coin each frame (because m_lastDistance = dist rounds to nearest). 50% of the time it restarts CTaskGoto.
  2. If CTaskGoto is restarted while we are wary close to the destination, the bot thinks we went it to go to its own power slot. This causes orbiting (see above). Note: the destination point is only calculated when CTaskGoto is started and is not updated when the bot moves

1) The use of volatile hides the bug by forcing dist to be a 32 bit float instead of 80 bit long double 2) Increasing the margin to 5 meters hides the bug because 2.675 < 5 3) 64 bit platforms are immune because there gcc defaults to -mfpmath=sse