Step over misbehaves with multiple threads on Linux

Quuxplusone commented 4 years ago


Bugzilla Link	PR45642
Status	NEW
Importance	P normal
Reported by	Jaroslav Sevcik (jarin@google.com)
Reported on	2020-04-22 16:07:59 -0700
Last modified on	2020-05-03 14:01:24 -0700
Version	10.0
Hardware	PC Linux
CC	jdevlieghere@apple.com, jingham@apple.com, llvm-bugs@lists.llvm.org
Fixed by commit(s)
Attachments	`lldb-log.txt` (125318 bytes, text/plain) `lldb-log.txt` (125318 bytes, text/plain) `lldb-log-sigtrap2.txt` (9096 bytes, text/plain) `dont-select-trace-on-stop.diff` (1708 bytes, text/plain) `file_45642.txt` (79160 bytes, text/plain)
Blocks
Blocked by
See also

Stepping over a line can report a spurious SIGTRAP when the thread hits a real
(public) breakpoint while another thread hits the step-over plan's private
breakpoint.

A repro is shown below (it is not 100% deterministic, but seems to repro in
more than 50% of cases). We have two threads; with one thread, we step over a
function call and hits a public breakpoint there, the other thread hits a
private breakpoint that was placed by the first thread's step-over plan (more
precisely, by its step-out child plan). When the debugger stops, it selects
that thread that hit the private breakpoint and confusingly reports SIGTRAP.

The repro console session:

$ cat u.cc
#include <thread>

void g() {
  printf(".");
}

void f() {
  while (true) {
    g();  // Break here, continue, then step over twice.
  }
}

int main() {
  std::thread t1(f);
  std::thread t2(f);

  t1.join();
  t2.join();
  return 0;
}

$ clang++ -O0 -g -pthread u.cc
$ lldb --version
lldb version 11.0.0
  clang revision cfd235e6fc7a2306644ee63bfea310d79084ef66
  llvm revision cfd235e6fc7a2306644ee63bfea310d79084ef66
$ lldb a.out
(lldb) target create "a.out"
Current executable set to 'a.out' (x86_64).
(lldb) b u.cc:9
Breakpoint 1: where = a.out`f() + 9 at u.cc:9:5, address = 0x00000000004011f9
(lldb) r
Process 229470 launched: 'a.out' (x86_64)
Process 229470 stopped
* thread #2, name = 'a.out', stop reason = breakpoint 1.1
    frame #0: 0x00000000004011f9 a.out`f() at u.cc:9:5
   6
   7    void f() {
   8      while (true) {
-> 9        g();  // Break here, then step over.
   10     }
   11   }
   12
(lldb) c
Process 229470 resuming
Process 229470 stopped
* thread #3, name = 'a.out', stop reason = breakpoint 1.1
    frame #0: 0x00000000004011f9 a.out`f() at u.cc:9:5
   6
   7    void f() {
   8      while (true) {
-> 9        g();  // Break here, then step over.
   10     }
   11   }
   12
(lldb) n
Process 229470 stopped
* thread #3, name = 'a.out', stop reason = step over
    frame #0: 0x00000000004011fe a.out`f() at u.cc:8:3
   5    }
   6
   7    void f() {
-> 8      while (true) {
   9        g();  // Break here, then step over.
   10     }
   11   }
(lldb) n
Process 229470 stopped
* thread #2, name = 'a.out', stop reason = signal SIGTRAP
    frame #0: 0x00000000004011fe a.out`f() at u.cc:8:3
   5    }
   6
   7    void f() {
-> 8      while (true) {
   9        g();  // Break here, then step over.
   10     }
   11   }
  thread #3, name = 'a.out', stop reason = breakpoint 1.1
    frame #0: 0x00000000004011f9 a.out`f() at u.cc:9:5
   6
   7    void f() {
   8      while (true) {
-> 9        g();  // Break here, then step over.
   10     }
   11   }
   12

There is also a related problem where step-over seemingly steps into a
function. This happens when another thread hits a public breakpoint at the same
time as the step over plan single-steps.

$ lldb a.out
(lldb) target create "a.out"
Current executable set to 'a.out' (x86_64).
(lldb) b t.cc:9
Breakpoint 1: where = a.out`f() + 9 at t.cc:9:5, address = 0x0000000000401209
(lldb) r
Process 193890 launched: 'jarin/a.out' (x86_64)
Process 193890 stopped
* thread #2, name = 'a.out', stop reason = breakpoint 1.1
    frame #0: 0x0000000000401209 a.out`f() at t.cc:9:5
   6
   7    void f() {
   8      while (true) {
-> 9        g();
   10     }
   11   }
   12
(lldb) n
Process 193890 stopped
* thread #2, name = 'a.out', stop reason = step over
    frame #0: 0x000000000040120e a.out`f() at t.cc:8:3
   5    }
   6
   7    void f() {
-> 8      while (true) {
   9        g();
   10     }
   11   }
  thread #3, name = 'a.out', stop reason = breakpoint 1.1
    frame #0: 0x0000000000401209 a.out`f() at t.cc:9:5
   6
   7    void f() {
   8      while (true) {
-> 9        g();
   10     }
   11   }
   12
(lldb) n
Process 193890 stopped
* thread #2, name = 'a.out', stop reason = breakpoint 1.1
    frame #0: 0x0000000000401209 a.out`f() at t.cc:9:5
   6
   7    void f() {
   8      while (true) {
-> 9        g();
   10     }
   11   }
   12
  thread #3, name = 'a.out', stop reason = breakpoint 1.1
    frame #0: 0x0000000000401209 a.out`f() at t.cc:9:5
   6
   7    void f() {
   8      while (true) {
-> 9        g();
   10     }
   11   }
   12
(lldb)
Process 193890 stopped
* thread #2, name = 'a.out', stop reason = step over
    frame #0: 0x000000000040120e a.out`f() at t.cc:8:3
   5    }
   6
   7    void f() {
-> 8      while (true) {
   9        g();
   10     }
   11   }
(lldb) n
Process 193890 stopped
* thread #2, name = 'a.out', stop reason = breakpoint 1.1
    frame #0: 0x0000000000401209 a.out`f() at t.cc:9:5
   6
   7    void f() {
   8      while (true) {
-> 9        g();
   10     }
   11   }
   12
  thread #3, name = 'a.out', stop reason = signal SIGTRAP
    frame #0: 0x000000000040120e a.out`f() at t.cc:8:3
   5    }
   6
   7    void f() {
-> 8      while (true) {
   9        g();
   10     }
   11   }
(lldb) n
Process 193890 stopped
* thread #2, name = 'a.out', stop reason = trace
    frame #0: 0x00000000004011d0 a.out`g() at t.cc:3
   1    #include <thread>
   2
-> 3    void g() {
   4      printf(".");
   5    }
   6
   7    void f() {
  thread #3, name = 'a.out', stop reason = breakpoint 1.1
    frame #0: 0x0000000000401209 a.out`f() at t.cc:9:5
   6
   7    void f() {
   8      while (true) {
-> 9        g();
   10     }
   11   }
   12

Quuxplusone commented 4 years ago

I can't reproduce any of this behavior with macOS.  I ran a little script that
kept stepping the currently selected thread after each stop, did that 100000 or
so times and the stop reason was always either none, breakpoint or plan
complete.

If you can run this example with:

log enable -f /tmp/lldb-log.txt lldb step
log enable -f /tmp/lldb-log.txt gdb-remote packets

and attach that file, that might show what's going wrong here.

SIGTRAP is the raw stop reason you get for stopping at a breakpoint.  We
convert that to an eStopReasonBreakpoint when we recognize the stop instruction
as one where we've inserted a breakpoint.  So it is curious we aren't
recognizing the breakpoint location we set.  But the logs might clear that up.

The fact that we print the "trace" in your second example is not, I think, a
functional problem.  When a process stops for a "real" reason, for instance
thread A completes a step over, then if you ask thread B why it stopped, we
just tell you the reason.  In your case it looks like the single-step over a
breakpoint and the hit of another breakpoint were reported at the same time, so
you saw both.

There isn't a mechanism to have threads suppress their stop reasons because
they are for internal purposes.  We could add a filter like that, but we'd have
to be careful not to hide useful information from people.

Quuxplusone commented 4 years ago

Attached lldb-log.txt (125318 bytes, text/plain): Lldb session log of the SIGTRAP problem

Quuxplusone commented 4 years ago

Attached lldb-log.txt (125318 bytes, text/plain): Lldb session log of the SIGTRAP problem

Quuxplusone commented 4 years ago

Thank you for looking at this, Jim!

While the stop reason indeed seems wrong, I was mostly objecting to the selection of the current thread after the stop - users expect the current thread to be the one that caused the stop (either by finishing its plan or by triggering a "real" breakpoint). In both cases I reported, LLDB chooses the thread that stopped on an internal breakpoint to be the current thread after the stop. We had several users reporting that step-over does not work in the debugger when this happens to them.

(I am sorry for double-posting the log - it is identical.)

Quuxplusone commented 4 years ago

Regarding the suppression of internal breakpoint stops - I do not think it is necessary as long as the user has a way to distinguish between a real stop reason and an internal stop reason.

Quuxplusone commented 4 years ago

I don't think the prioritization is actually wrong. lldb has to pick one "currently selected thread" on stop, I'm not sure what it would look like to have two "currently selected threads"... When two threads stop for a reason, we show both, but we have to applying some heuristics as to which one to select.

Signals are "more extraordinary" than breakpoint hits, and are unexpected, so lldb prioritizes a thread that stops with a signal over one that hits a breakpoint. The fact that the SIGTRAP wasn't really an independent signal in this sense but was a misidentified breakpoint hit makes the data the decision was made on bad, not the decision. If it HAD been a real SIGTRAP - e.g. one of your threads hit a __builtin_trap - that would be the thread you wanted to see first.

We don't currently classify stop reasons as internal or external. So there's no way to show that currently. If we do the work to add this bit, then it wouldn't make any sense not to go all the way and suppress the internal ones. That should be a separate enhancement request, however. If it gets folded into this report it will get lost.

Thanks for the log, I'll take a look and see if I can see what's up.

Quuxplusone commented 4 years ago

> I don't think the prioritization is actually wrong.  lldb has to pick one
> "currently selected thread" on stop, I'm not sure what it would look like to
> have two "currently selected threads"...  When two threads stop for a
> reason, we show both, but we have to  applying some heuristics as to which
> one to select.

I have not suggested to select two threads, I just said lldb should avoid
selecting threads that stopped on internal breakpoints. Our users coming from
Visual Studio consider our stepping-over logic broken because of this behavior.

> Signals are "more extraordinary" than breakpoint hits, and are unexpected,
> so lldb prioritizes a thread that stops with a signal over one that hits a
> breakpoint.  The fact that the SIGTRAP wasn't really an independent signal
> in this sense but was a misidentified breakpoint hit makes the data the
> decision was made on bad, not the decision.  If it HAD been a real SIGTRAP -
> e.g. one of your threads hit a __builtin_trap - that would be the thread you
> wanted to see first.

I agree that between a real signal and a breakpoint, lldb should select the
signal. What you seem to be saying is that we should not have classified the
stop on other thread's internal breakpoint as a signal. If that's the case then
we are in agreement.

Strangely enough, if you look at the second-to-last step in the second example
from my report (where step-over seemingly steps-in), lldb stopped on a
SIGTRAP/misinterpreted-internal-breakpoint and on a real breakpoint and chose
the breakpoint rather than the SIGTRAP. From my experiments, it looks like when
given a choice between a breakpoint hit and a signal, lldb will simply choose
the one with the lower thread index. This seems to be what
https://github.com/llvm/llvm-
project/blob/3d178581ac7f5336b1ac75e31001de074ecca937/lldb/source/Target/Process.cpp#L892
is doing.

> We don't currently classify stop reasons as internal or external.  So
> there's no way to show that currently.  If we do the work to add this bit,
> then it wouldn't make any sense not to go all the way and suppress the
> internal ones.  That should be a separate enhancement request, however.  If
> it gets folded into this report it will get lost.

As already noted above, this seems to be an important problem to me - if lldb
stops on an internal breakpoint and selects the thread that stopped on that
breakpoint (without finishing any plan), it is very confusing for users. It
just looks like a spurious stop. To understand the current behavior, the users
would have to know how lldb implements stepping.

Quuxplusone commented 4 years ago

Humm. That does look wrong.

The eStopReasonSignal case should NOT check whether other_thread is set. That way it would be prioritized over any of the other "other reasons". I could swear I wrote it that way originally, but this sort of thing is really hard to write tests for, so it's not entirely surprising that it isn't quite right...

BTW we could also add another bucket for signal threads specifically. That would make the logic clearer.

It also looks like I misremembered, and we prioritize a plan complete thread over a signal thread. I'm not sure I agree with older myself about that.

But note, this is all about what thread we select, not what we show in the stop notification. We still list all the threads that stopped for a reason in the stop notification in index order. The code you point to is just about what we set the "selected thread" to, not what we print in the stop notification. It seems to me you are more concerned about the thread listing and not the selected thread.

We won't ever show the user a stop where all the threads that stopped for a reason are stopping for internal plan related jobs. There will always be one "real" stop. But we could do a better job of hiding threads that aren't interesting to users when we stop.

None of this would be terribly hard to do. Are you interested in working on this, or just on having it fixed? If the former, I'd be happy to help you out.

Quuxplusone commented 4 years ago

> BTW we could also add another bucket for signal threads specifically.  That
> would make the logic clearer.

Right, although that feels like another feature request.

> It also looks like I misremembered, and we prioritize a plan complete thread
> over a signal thread.  I'm not sure I agree with older myself about that.

FWIW, I tend to agree with the new you.

> But note, this is all about what thread we select, not what we show in the
> stop notification.  We still list all the threads that stopped for a reason
> in the stop notification in index order.  The code you point to is just
> about what we set the "selected thread" to, not what we print in the stop
> notification.  It seems to me you are more concerned about the thread
> listing and not the selected thread.

I actually thought the problem is just the selected thread, but you (and a
couple of days staring at the debugger) convinced me that we are looking at two
different problems.

> We won't ever show the user a stop where all the threads that stopped for a
> reason are stopping for internal plan related jobs.  There will always be
> one "real" stop.  But we could do a better job of hiding threads that aren't
> interesting to users when we stop.

Better thread selection would solve the step-over-step-in problem. I am still
wondering why lldb considers threads with the "trace" stop reason to be
selected even if they do not complete a plan. I tried to ignore such threads,
the test suite still passes and I do not see the spurious step-ins anymore. I
will attach the trivial patch with that change.

The situation with SIGTRAP is interesting - after some debugging, I understand
a bit better what is going on. The problem happens when one thread is stopped
on a breakpoint, then we remove that breakpoint and do a single step on some
other thread. The single step does not change the stop info of the first thread
- it still believes it is parked on a breakpoint. However, since the breakpoint
has been removed in the meantime, lldb will be confused about that situation
(https://github.com/llvm/llvm-
project/blob/f30416fdde922eaa655934e050026930fefbd260/lldb/source/Plugins/Process/gdb-
remote/ProcessGDBRemote.cpp#L1830) and will report SIGTRAP. I am attaching a
log that illustrates this on a simple example.

It is not quite clear to me how to handle the situation of removing a
breakpoint while a thread is standing on it. I could imagine changing the stop
info in lldb-server to start reporting some other stop info when the breakpoint
gets removed (not sure which one), but there might be other solutions. Thoughts?

> None of this would be terribly hard to do.  Are you interested in working on
> this, or just on having it fixed?  If the former, I'd be happy to help you
> out.

It would be great if somebody else fixed the issue; however, if no one can pick
it up, I can (slowly) work on this.

Quuxplusone commented 4 years ago

Attached lldb-log-sigtrap2.txt (9096 bytes, text/plain): LLDB log showing how a deleted breakpoint becomes SIGTRAP

Quuxplusone commented 4 years ago

Attached dont-select-trace-on-stop.diff (1708 bytes, text/plain): Patch for ignoring threads with stop reason "trace"

Quuxplusone commented 4 years ago

For this to be right you have to convince yourself that a process will never be able to stop with a trace reason that wasn't caused by the debugger.

What you are really trying to do is more general, namely to make sure that if a thread stops for a reason, but the currently active thread plans explain that reason, then show eStopReasonNone for that thread instead of the actual stop reason. Of course, if the plan was completed by this stop, you'll show the eStopReasonPlanCompleted instead.

The thread should be able to figure all this out during the ShouldStop negotiation, so then it can just use what it has computed when you call Thread::GetStopInfo.

Quuxplusone commented 4 years ago

Note also, the fact that we aren't properly handling the "hanging stop reason"
seems to be specific to Linux.  Using the same example on macOS, I see:

(lldb) run
Process 62410 launched: '/tmp/threadit' (x86_64)
Process 62410 stopped
* thread #2, stop reason = breakpoint 1.1
    frame #0: 0x0000000100000ebb threadit`f() at threadit.cpp:6
   3    void f() {
   4      int i = 0;
   5      while (true) {
-> 6        i++;  // Break both threads here, delete the BP,
             ^
   7              // single step.
   8      }
   9    }
  thread #3, stop reason = breakpoint 1.1
    frame #0: 0x0000000100000ebb threadit`f() at threadit.cpp:6
   3    void f() {
   4      int i = 0;
   5      while (true) {
-> 6        i++;  // Break both threads here, delete the BP,
             ^
   7              // single step.
   8      }
   9    }
Target 0: (threadit) stopped.
(lldb) thread list
Process 62410 stopped
  thread #1: tid = 0xaf85e2, function: __ulock_wait
* thread #2: tid = 0xaf8600, function: f() , stop reason = breakpoint 1.1
  thread #3: tid = 0xaf8601, function: f() , stop reason = breakpoint 1.1
(lldb) break del 1
1 breakpoints deleted; 0 breakpoint locations disabled.
(lldb) thread list
Process 62410 stopped
  thread #1: tid = 0xaf85e2, function: __ulock_wait
* thread #2: tid = 0xaf8600, function: f() , stop reason = breakpoint 1.1
  thread #3: tid = 0xaf8601, function: f() , stop reason = breakpoint 1.1
(lldb) si
Process 62410 stopped
* thread #2, stop reason = instruction step into
    frame #0: 0x0000000100000ebe threadit`f() at threadit.cpp:6
   3    void f() {
   4      int i = 0;
   5      while (true) {
-> 6        i++;  // Break both threads here, delete the BP,
             ^
   7              // single step.
   8      }
   9    }
Target 0: (threadit) stopped.
(lldb) thread list
Process 62410 stopped
  thread #1: tid = 0xaf85e2, function: __ulock_wait
* thread #2: tid = 0xaf8600, function: f() , stop reason = instruction step into
  thread #3: tid = 0xaf8601, function: f()

Which appears right to me.

Quuxplusone commented 4 years ago

> But I don't think that explains what was happening in your original
> example.
>
> In that case, say two threads stopped at the step out breakpoint
> address.  So we need to step them one by one over the breakpoint trap.
> To do that, we would not use a "step instruction" thread plan, we
> would use a "step over breakpoint" thread plan.  That plan will
> disable the breakpoint, do the single step with all the other threads
> stopped, and re-enable the breakpoint after stopping.  So by the time
> control passes back to the user level, where we query about the stop
> status, the breakpoint will already be back in place.

I still do not understand how it is different. Note that in my
original SIGTRAP example, the last step is not "step over breakpoint",
it is "Step range stepping over". Before the last step is started, the
step-out plan is already finished -- thread 3 just finished the
step-out and step-over plans, thread 2 stumbled upon the step-out
breakpoint, but correctly figured out it was not relevant, the
step-out breakpoint was removed after the stop infos are computed; all
that before the last step.

> Note also, the fact that we aren't properly handling the "hanging
> stop reason" seems to be specific to Linux.

This is very interesting! Could you possibly share the log
of that macOS session? (lldb step and gdb-remote packets).

Regarding the step-out-step-in-problem:
> What you are really trying to do is more general, namely to make sure
> that if a thread stops for a reason, but the currently active thread
> plans explain that reason, then show eStopReasonNone for that thread
> instead of the actual stop reason.  Of course, if the plan was
> completed by this stop, you'll show the eStopReasonPlanCompleted
> instead.
>
> The thread should be able to figure all this out during the ShouldStop
> negotiation, so then it can just use what it has computed when you
> call Thread::GetStopInfo.

Yes, this sounds like a better fix, although I am still not quite sure
how to implement that. I guess what you are suggesting is to introduce
some flag in Thread and set the flag when ShouldStop can explain the
stop reason but the plan is not completed. Then we would use the flag
during GetStopInfo to pretend there was no reason to stop.

However, it is less clear when to reset this flag. Also, ShouldStop is
not called for all the threads. In fact, in the step-over-steps-in
example, the last step is "step over trap" on a different thread than
the one that has the "trace" stop info. So the "trace" is actually
left-over from the previous (private) stop. To figure out if the
"trace" was explained we would would have to ask the thread plan
in addition to the existing ShouldStop queries (or somehow remember
this info from the previous thread).

Any guidance on this?

Quuxplusone commented 4 years ago

Attached file_45642.txt (79160 bytes, text/plain): MacOS version of the stop-remove breakpoint-stepi showing correct stop reason

Quuxplusone commented 4 years ago

I attached the MacOS log.

The situations are different but in all these cases it looks like the problem
is what happens with multiple threads stop "for a reason" and then you resume
but suspend all the threads but one.  The stop reasons for the threads that
don't get to run seem to get messed up.

One thing that seems different.  In the Linux gdb-remote trace for the first
example, we stop after the step-over hits the breakpoint, and both pthread
threads have a stop reason of breakpoint and signal: 5.  They are both at
breakpoints.  Then we delete the step breakpoint, and only step one thread,
leaving the others suspended.  When we next call jthreadsinfo, the thread that
wasn't allowed to run comes back from lldb-server with a stop info of signal: 5:

lldb             WillResume Thread #1 (0x0x55c3687acf20): tid = 0xf54e, pc =
0x7ffff7c574b8, sp = 0x7fffffffd9e0, fp = 0x7ffff7a87700, plan = 'base plan',
state = suspended, stop others = 0
lldb             WillResume Thread #2 (0x0x7fb6e6143110): tid = 0xf782, pc =
0x004011fe, sp = 0x7ffff7a86e30, fp = 0x7ffff7a86e30, plan = 'base plan', state
= suspended, stop others = 0
lldb             WillResume Thread #3 (0x0x7fb6e6011310): tid = 0xf783, pc =
0x004011fe, sp = 0x7ffff7285e30, fp = 0x7ffff7285e30, plan = 'Step range
stepping over', state = stepping, stop others = 1
intern-state     Current Plan for thread 1(0x55c3687acf20) (0xf54e, suspended):
base plan being asked whether we should report run.
intern-state     Current Plan for thread 2(0x7fb6e6143110) (0xf782, suspended):
base plan being asked whether we should report run.
intern-state     Current Plan for thread 3(0x7fb6e6011310) (0xf783, stepping):
Step range stepping over being asked whether we should report run.
b-remote.async>  <  16> send packet: $vCont;s:f783#fa
lldb             Process thinks the process has resumed.
b-remote.async>  < 917> read packet:
$T05thread:f783;name:a.out;threads:f54e,f782,f783;jstopinfo:5b7b226e616d65223a22612e6f7574222c22746964223a36323739387d2c7b226e616d65223a22612e6f7574222c22726561736f6e223a22627265616b706f696e74222c227369676e616c223a352c22746964223a36333336327d2c7b226e616d65223a22612e6f7574222c22726561736f6e223a227472616365222c227369676e616c223a352c22746964223a36333336337d5d;thread-pcs:00007ffff7c574b8,00000000004011fe,00000000004011f9;00:0100000000000000;01:0000000000000000;02:0000000000000000;03:006728f7ff7f0000;04:c05828f7ff7f0000;05:0000000000000000;06:305e28f7ff7f0000;07:305e28f7ff7f0000;08:006728f7ff7f0000;09:0100000000000000;0a:0420400000000000;0b:0602000000000000;0c:6ed9ffffff7f0000;0d:6fd9ffffff7f0000;0e:006728f7ff7f0000;0f:c05f28f7ff7f0000;10:f911400000000000;11:0602000000000000;12:3300000000000000;13:0000000000000000;14:0000000000000000;15:2b00000000000000;16:0000000000000000;17:0000000000000000;reason:trace;#9e
intern-state
intern-state     ThreadList::ShouldStop: 3 threads, 1 unsuspended threads
intern-state     Thread::ShouldStop(0x7fb6e6011310) for tid = 0xf783 0xf783, pc
= 0x00000000004011f9
intern-state     ^^^^^^^^ Thread::ShouldStop Begin ^^^^^^^^
intern-state     Plan stack initial state:
  thread #3: tid = 0xf783:
    Active plan stack:
      Element 0: Base thread plan.
      Element 1: Stepping over line u.cc:8:3 using ranges: [0x00000000004011fe-0x0000000000401210).

intern-state     <  21> send packet: $x7ffff7285e00,200#a0
intern-state     < 516> read packet:
$00000000000000000870410000000000305e28f7ff7f0000e511400000000000305e28f7ff7f0000fe11400000000000505e28f7ff7f0000071940000000000008704100000000000870410000000000705e28f7ff7f00009d1840000000000000000000000000000870410000000000905e28f7ff7f0000751840000000000008704100000000000000000000000000b05e28f7ff7f0000451840000000000000000000000000000870410000000000d05e28f7ff7f00000e1740000000000000000000000000000070410000000000007041000000000070b9e9f7ff7f0000000000000000000000000000000000006ed9ffffff7f0000b75fc5f7ff7f00000000000000000000006728f7ff7f0000006728f7ff7f00001dc7dc9f1968fc0e6ed9ffffff7f00006fd9ffffff7f0000006728f7ff7f0000c05f28f7ff7f00001dc7dc21498603f11dc76621938703f1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007c9288164421a10000000000000000006728f7ff7f000000000000000000009f71b8f7ff7f000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000#51
intern-state     Base plan discarding thread plans for thread tid = 0xf783
(breakpoint hit.)
intern-state     Discarding thread plans for thread (tid = 0xf783, force 0)
intern-state     Step range plan out of range to 0x4011f9
intern-state     Plan Step range stepping over being discarded in cleanup, it
says it is already done.
intern-state     Discarding plan: "Step range stepping over", tid = 0xf783.
intern-state     Plan stack final state:
  thread #3: tid = 0xf783:
    Active plan stack:
      Element 0: Base thread plan.
    Completed plan stack:
      Element 0: Stepping over line u.cc:8:3 using ranges: [0x00000000004011fe-0x0000000000401210).

intern-state     vvvvvvvv Thread::ShouldStop End (returning 1) vvvvvvvv
intern-state     ThreadList::ShouldStop overall should_stop = 1
lldb             <  16> send packet: $jThreadsInfo#c1
lldb             < 411> read packet:
$[{"name":"a.out","registers":{"16":"b874c5f7ff7f0000","6":"0077a8f7ff7f0000","7":"e0d9ffffff7f0000"}],"tid":62798}],{"name":"a.out","reason":"breakpoint","registers":{"16":"fe11400000000000","6":"306ea8f7ff7f0000","7":"306ea8f7ff7f0000"}],"signal":5,"tid":63362}],{"name":"a.out","reason":"trace","registers":{"16":"f911400000000000","6":"305e28f7ff7f0000","7":"305e28f7ff7f0000"}],"signal":5,"tid":63363}]]#d3

In the macOS trace the thread that was suspended doesn't have a stop reason of
signal.  Somehow this difference of behavior is confusing us.

Quuxplusone commented 4 years ago

(In reply to Jaroslav Sevcik from comment #15)
>
> Yes, this sounds like a better fix, although I am still not quite sure
> how to implement that. I guess what you are suggesting is to introduce
> some flag in Thread and set the flag when ShouldStop can explain the
> stop reason but the plan is not completed. Then we would use the flag
> during GetStopInfo to pretend there was no reason to stop.
>
> However, it is less clear when to reset this flag. Also, ShouldStop is
> not called for all the threads. In fact, in the step-over-steps-in
> example, the last step is "step over trap" on a different thread than
> the one that has the "trace" stop info. So the "trace" is actually
> left-over from the previous (private) stop. To figure out if the
> "trace" was explained we would would have to ask the thread plan
> in addition to the existing ShouldStop queries (or somehow remember
> this info from the previous thread).
>
> Any guidance on this?

ShouldStop is only skipped for threads that were suspended on the previous run.
Since we know we didn't let them run, we can also assume that the same decision
about whether to publish their stop reason pertains.  So I don't think this
will be a problem.

I think you can just do this by asking a thread if one of its plans explained
the stop, and if so whether the plan was complete.  If the plan that explains
the stop
is still active, then its stop reason shouldn't be shown.  It's in the middle
of its job and so the reason why it stops is not interesting to the user.  You
would have to exempt the base thread plan from this, since its job is to
forward stop reasons to the user.

Quuxplusone commented 4 years ago

Thank you for the log. It indeed looks like debugserver sends a cleaned
up stop infos and thus avoids the problem. It is not yet clear to me
whether the breakpoint stop info is removed because of the breakpoint
deletion or because of the (instruction) single stepping.

I would not be surprised if fixing the bogus stop info also fixes
(or at least greatly mitigates) the other problem because there we
see a step-over-trap single step preserve trace stop reason of another
thread, which ultimately confuses the current thread selection.

Unfortunately, I have some other work to do now, but I do intend
to get back to this later this week or next week.

Thank you for your help!

Quuxplusone commented 4 years ago

Jim, I cooked a simple patch that tries to replicate debugserver's behavior in lldb-server, see https://reviews.llvm.org/D79308. Debugserver seems to simply reset each thread's exception (which seems to basically map to stop reason) when resuming for stepping or for continuing all threads. The patch fixes both problems reported in this bug. Could you take a look at the patch or suggest the right person to discuss the patch?

Quuxplusone / LLVMBugzillaTest

Step over misbehaves with multiple threads on Linux #44612