Open Quuxplusone opened 3 years ago
I see three stages of the process that seem not to be working on your system.
1) When lldb is "nexting over a function" and sees that it is at a "call" instruction (i.e. one that is guaranteed to return to the next instruction), it won't try to step in and then step back out of the function, it will instead put a breakpoint on the instruction after the call and run to there.
We use the return from InstructionLLVMC::IsCall for the stub dispatch instruction to decide whether to take this shortcut or not. If that were returning true, we wouldn't have stepped into printf and had to step back out again. Of course if you are using a more general instruction to dispatch to the cross-library stub that isn't a call then this is not relevant. We will have to step in and then back out to follow the control flow.
2) When we do a step-in and land in shared library trampoline code, we usually step "through" the trampoline to the target function automatically. That doesn't seem to be working, since we stop in the stub in your a.out, not in the target of the stub. Stepping through cross-library stubs is handled by the GetStepThroughTrampolinePlan call in the DynamicLoader plugin for your system.
It looks like that isn't working in your case. Note, if your dynamic loader behavior is different from the standard Linux loader behavior, you will have to write a DynamicLoader plugin that describes how it works..
3) Finally, when we try to step out of a frame, we ask the unwinder for the return pc from the frame above us. It looks like either we don't know how to unwind from the cross-library stub, and so we can't find the previous frame or we can't find the return PC in that frame. There's something about being stopped at a cross-library stub that's fooling the unwinder.
These are all failures specific to support for your system, it's the plugin-points that are failing so far as I can see, not the basic algorithms. Somebody who has access to this system and can debug these problems will need to have a look.
When I use musl-gcc, LLDB isn't able to unwind either from the executable's
.plt or from musl's libc.so, whereas that does work if I use the gcc driver
(for using glibc). At least for the executable's .plt, LLDB identifies two ways
to unwind, "assembly insn profiling" and "eh_frame CFI", but of course it
doesn't trust eh_frame to be valid at every PC, so maybe that's affecting
something. I need to debug it some more.
Even on macOS, though, I see the same behavior where LLDB steps over a call by
stepping into it and back out. I'll attach an example.
I'm suspicious of a couple of comparisons in LLDB's
ThreadPlanStepRange::SetNextBranchBreakpoint that *seem* off-by-two and off-by-
one:
https://github.com/llvm/llvm-project/blob/llvmorg-13.0.0-rc1/lldb/source/Target/ThreadPlanStepRange.cpp#L334-L347
// If we didn't find a branch, run to the end of the range.
if (branch_index == UINT32_MAX) {
uint32_t last_index = instructions->GetSize() - 1;
if (last_index - pc_index > 1) {
...
}
} else if (branch_index - pc_index > 1) {
The (last_index - pc_index > 1) comparison requires that the list of
instructions have at least 3 remaining instructions to execute, whereas
intuitively I'd expect 1 instruction to be enough? i.e. The if statement can be
removed.
The (branch_index - pc_index > 1) comparison requires that there be at least 2
instructions between the PC and the branch instruction, but intuitively 1 ought
to be enough. The comparison could become (branch_index > pc_index).
e.g. From my attached apple-step-into-example.txt, the assembly is:
(lldb) disas
a.out`main:
0x100003f7a <+0>: pushq %rbp
0x100003f7b <+1>: movq %rsp, %rbp
-> 0x100003f7e <+4>: leaq 0x29(%rip), %rdi ; "hello!"
0x100003f85 <+11>: callq 0x100003f8e ; symbol stub for: puts
0x100003f8a <+16>: xorl %eax, %eax
0x100003f8c <+18>: popq %rbp
0x100003f8d <+19>: retq
I *think* the list of instructions would have the leaq and callq instructions,
pc_index would be 0, and there is no branch at the end. last_index is 1
(referring to the callq instruction). We want to set a breakpoint on the xorl
instruction and run to it, but ThreadPlanStepRange::SetNextBranchBreakpoint
doesn't set a breakpoint, and LLDB seems to single-step one instruction at-a-
time. (I'm not sure how that works yet...)
$ cat apple-step-into-example.txt | grep reached
ThreadPlanStepOverRange reached 0x0000000100003f85.
ThreadPlanStepOverRange reached 0x0000000100003f8e.
ThreadPlanStepOverRange reached 0x0000000100003f8a.
This commit is relevant:
https://github.com/llvm/llvm-project/commit/a3f466b9e785ca8f6712904e408bda31c79ca1b0
"Fix commit 252963 to work around a bug on some platforms where they don't
correctly handle stepping over one breakpoint directly onto another breakpoint.
This isn't fixing that bug, but rather just changing 252963 to not use
breakpoints
if it is only stepping one instruction."
Stepping one instruction isn't generally sufficient to reach the branch (or the
end of instruction list), though:
- If branch_index is UINT32_MAX: pc_index could point to the last instruction, which is a call that we want to step over. In that case, the breakpoint would be one instruction further (but multiple instructions run).
- If branch_index is a real branch: pc_index could point to a call instruction just before the branch.
I experimented with the above changes to
ThreadPlanStepRange::SetNextBranchBreakpoint, and it fixes this bug, but
reveals another with musl-gcc:
- Set a breakpoint on a call instruction to printf (or puts).
- Run to the breakpoint.
- Try to step over (n) the call instruction.
- The program doesn't stop until it exits.
I think(?) LLDB is required to single-step off the breakpoint, which puts the
PC in the printf linker stub. From there, I guess LLDB can't unwind so it
doesn't know what to do? But in principle, it could still work, because it
could just run to the address past the call. That doesn't happen, though. I'll
attach it in case it's interesting, lldb-step-off-breakpoint.txt. That file has
a few interesting lines:
(lldb) disas
a.out`main:
0x555555555155 <+0>: pushq %rbp
0x555555555156 <+1>: movq %rsp, %rbp
0x555555555159 <+4>: leaq 0xea0(%rip), %rdi
-> 0x555555555160 <+11>: callq 0x555555555020 ; symbol stub for:
___lldb_unnamed_symbol61
0x555555555165 <+16>: movl $0x0, %eax
0x55555555516a <+21>: popq %rbp
0x55555555516b <+22>: retq
(lldb) log enable lldb step
(lldb) n
lldb Thread::PushPlan(0x0x1103bc0): "Stepping over line
hello.c:4:3.", tid = 0x187bf.
lldb ThreadPlanStepRange::SetNextBranchBreakpoint - Setting
breakpoint -2 (site 4) to run to address 0x555555555165
lldb Process::PrivateResume() m_stop_id = 5, public state: stopped
private state: stopped
lldb Thread::PushPlan(0x0x1103bc0): "Single stepping past
breakpoint site 3 at 0x555555555160", tid = 0x187bf.
...
intern-state ThreadPlanStepOverRange reached 0x0000555555555020.
intern-state Removing next branch breakpoint: -2.
intern-state Stepping out of frame with no debug info
...
Attached apple-step-into-example.txt
(7970 bytes, text/plain): apple-step-into-example.txt
Attached lldb-step-off-breakpoint.txt
(5303 bytes, text/plain): lldb-step-off-breakpoint.txt
If you have a fix for the logic to make us step past the call instruction rather than stepping in, please put that up for review. That seems useful, though TTTT the optimization of "not doing step-in & step-out" is a best effort thing. We still expect that as a last resort step-in and step-back-out should work.
In the "step-off-breakpoint" error, stepping here is not going to go well if we can't unwind from the stub. We can't really do anything sensible, so we should just stop. The bug here is that, after the step over line plan was popped off the stack, we end up saying:
intern-state ThreadList::ShouldStop overall should_stop = 0
and then continuing.
Since the ThreadPlanStepRange plan was done & popped off the stack, we would next ask the base thread plan whether to stop in: ThreadPlanBase::ShouldStop. That's a moderately complex function and there's not enough info in your log to guess why it is deciding to resume. Somebody will need to debug that live.
We've talked at times about having a different "I'm lost in a new stack frame" response than to just stop. We could, for instance, just keep single stepping and hope we get somewhere better. Maybe single step till the function above us shows up again in the backtrace, and then try to set our return breakpoint.
But for right now the simplest thing is "if confused, stop!"
Yeah, I can upload my patch that allows for running to a breakpoint in more situations.
In the "step-off-breakpoint" error, stepping here is not going to go well if we can't unwind from the stub.
In this case, LLDB sets a breakpoint after the call instruction, then steps into the called function. It seems(?) like it could simply leave the breakpoint in place and resume, rather than clear the post-call breakpoint and try to unwind (which if successful would simply put a breakpoint at the same post-call PC).
I'm still unfamiliar with the "plan" stuff -- I'll probably study it a bit more.
I debugged unwinding a bit. LLDB is able to unwind from the executable .plt's back to main, and then from main into libc_start_main_stage2. However, I think it doesn't find unwind info for libc_start_main_stage2, so it can't establish a certain CFA value, then it discards both the frame for main and libc_start_main_stage2. I need to study it more closely, but I think LLDB could do better. Maybe musl needs an "end of the stack" annotation somehow?
I also noticed that SectionLoadList::ResolveLoadAddress is unable to lookup an address in musl's libc.so (e.g. "disas -n printf" works but "disas -a
(In reply to Ryan Prichard from comment #7)
> Yeah, I can upload my patch that allows for running to a breakpoint in more
> situations.
>
> > In the "step-off-breakpoint" error, stepping here is not going to go well
if we can't unwind from the stub.
>
> In this case, LLDB sets a breakpoint after the call instruction, then steps
> into the called function. It seems(?) like it could simply leave the
> breakpoint in place and resume, rather than clear the post-call breakpoint
> and try to unwind (which if successful would simply put a breakpoint at the
> same post-call PC).
That's quite likely. The optimization to treat calls specially is relatively
new. For the longest time, we always ran to the next "branch" instruction no
matter the kind, and then did the "step in/step out" trick. We made no
assumptions about the character of the branch we were at. The special handling
of "call" instructions was recent. BTW, by "call" instruction, we just mean an
instruction that is guaranteed (except for exceptions) to return to the next
instruction following it.
So it's not surprising that this opportunity to simplify the algorithm hasn't
been taken yet. Again, if you feel like digging in, feel encouraged to do so,
and I'll happily review any patches.
My plate is pretty full, so left to me, this may take a while.
>
> I'm still unfamiliar with the "plan" stuff -- I'll probably study it a bit
> more.
>
> I debugged unwinding a bit. LLDB is able to unwind from the executable
> .plt's back to main, and then from main into libc_start_main_stage2.
> However, I think it doesn't find unwind info for libc_start_main_stage2, so
> it can't establish a certain CFA value, then it discards both the frame for
> main and libc_start_main_stage2. I need to study it more closely, but I
> think LLDB could do better. Maybe musl needs an "end of the stack"
> annotation somehow?
Maybe a separate bug about this might be a good way to go. I'm not an expert
on the unwinder, and this bug is starting to become a portmanteau bug...
>
> I also noticed that SectionLoadList::ResolveLoadAddress is unable to lookup
> an address in musl's libc.so (e.g. "disas -n printf" works but "disas -a
> <printf-addr>" doesn't). I don't think it's quite related, so I'll file a
> separate bug for it. I suspect the "TODO: remove this once we either fix
> library matching or avoid" in DynamicLoaderPOSIXDYLD.cpp special case is
> broken.
Thanks!
Yeah, I may have some time to dig into this more.
> I also noticed that SectionLoadList::ResolveLoadAddress is unable to lookup
> an address in musl's libc.so [...]
Filed as llvm.org/PR51466.
apple-step-into-example.txt
(7970 bytes, text/plain)lldb-step-off-breakpoint.txt
(5303 bytes, text/plain)