Open 1C4nfaN opened 1 year ago
The timestamps in your log show that py-spy was attached to 479 for more than 15 minutes which is way too much. This is probably a bug in py-spy.
As an immediate workaround I can suggest you use py-spy with --noblocking
flag, which might make it a bit less accurate, but will prevent any side effects on the profiled application.
hi, @Jongy
Thanks for your suggestion, but using the --nonblocking
flag is not enough for what I want to get.
If this is a bug, do you have any suggestions for us try to fix it on this case?
If this is a bug, do you have any suggestions for us try to fix it on this case?
I spent a few minutes looking in the relevant code now.
impl ThreadLock {
fn new(tid: nix::unistd::Pid) -> Result<ThreadLock, nix::Error> {
ptrace::attach(tid)?;
while let wait::WaitStatus::Stopped(_, sig) = wait::waitpid(tid, Some(wait::WaitPidFlag::WSTOPPED | wait::WaitPidFlag::__WALL))? {
if sig == nix::sys::signal::Signal::SIGSTOP {
break;
}
debug!("reinjecting non-SIGSTOP signal {} to {}", sig, tid);
ptrace::cont(tid, sig)?;
}
debug!("attached to thread {}", tid);
Ok(ThreadLock{tid})
}
}
py-spy, per your stack trace, is blocked in this waitpid()
call. I'm acquainted with this code because I actually wrote it 2 years back :sweat_smile: see here.
We are trying (in remoteprocess & py-spy) to PTRACE_ATTACH the process, then reinject non-SIGSTOP signals. The way I understand the semantics of PTRACE_ATTACH, a SIGSTOP must follow it. However what we're seeing here is - the thread 479 got numerous non-SIGSTOP signals over 17 minutes, but no SIGSTOP! I don't know how it can happen - that's the bug. What happens meanwhile is, that some of the threads are already PTRACE_ATTACHed (in this case, 432 and 460) so the process is in a bad state. Some of its threads are blocked unexpectedly.
I don't have any concrete idea how to solve it - I'll keep thinking and post here if I find anything. A workaround I can suggest is to add basic timeout to the logic. If we fail to lock all threads in, say, 500us * Nthreads, py-spy could bail on it, detach all threads it has already attached, and consider this sample an error. I don't suppose this problem reproduces consistently (as you said @1C4nfaN , "sometimes", and this seems like a tough race condition), so perhaps this solution will be enough.
To debug this further, I'd run alongside perf record -e signal:*
and see the interaction - is a SIGSTOP generated? What else happens around that time?
@1C4nfaN if you manage to reproduce it consistently, please post here instructions to do so, it will greatly help me (or anyone else debugging) since this is very esoteric.
A workaround I can suggest is to add basic timeout to the logic. If we fail to lock all threads in, say, 500us * Nthreads, py-spy could bail on it, detach all threads it has already attached, and consider this sample an error.
Hi @Jongy , Thanks for your help!I think it's good advice to consider timeouts in this case, but in another test, i see that
23-04-11T04:02:05.106454671Z DEBUG remoteprocess::linux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:05.267191159Z DEBUG remoteprocesSS: linux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:05.427691376Z DEBUG remoteprocess::1inux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:05.588338448Z DEBUG remoteprocess::1inux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:05.748893669Z DEBUG remoteprocess::1inux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:05.909496956Z DEBUG remoteprocess::1inux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:06.070129688Z DEBUG remoteprocess::1inux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:06.230674664Z DEBUG remoteprocess::1inux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:06.391133380Z DEBUG remoteprocess::1inux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:06.551662220Z DEBUG remoteprocess::linux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:06.712171494Z DEBUG remoteprocess::1inux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:06.872638318Z DEBUG remoteprocess::1inux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:07.033214994Z DEBUG remoteprocess::linux] reinjecting non-SIGSTOP signal SIGCONto 509 23-04-11T04:02:07.193885198Z DEBUG remoteprocess::linux| reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:07.3544982832 DEBUG remoteprocess::linux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:07.515260753Z DEBUG remoteprocess::1inux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:07.675824288Z DEBUG remoteprocess::1inux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:07.836527054Z DEBUG remoteprocess::1inux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:07.997126241Z DEBUG remoteprocess::1inux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:08.157658180Z DEBUG remoteprocess: linux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:08.318285370Z DEBUG remoteprocess::linux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:08.4789423522 DEBUG remoteprocess::1inux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:08.639499228Z DEBUG remoteprocess::1inux] reinjecting non-SIGSTOP signal SIGCONT to 509 23-04-11T04:02:08.800293246Z DEBUG remoteprocess::linux] reinjecting non-SIGSTOP signal SIGCONto 509
It's not blocking but keeps triggering waitpid
and ptrace::cont
in the while loop, really confused me...
I checked the python process sampled by py-spy, the pid is 432, 460/479/481/509 are its thread. What's interesting is that there is a process that I didn't pay attention to before that is monitoring 432's cpu usage, and may periodically call signal.kill(432, SIGSTOP)/signal.kill(432, SIGCONT) to 432, I think there might be a race condition here (although I really haven't figured out why)
Would like to ask your opinion on using PTRACE_SEIZE instead of PTRACE_ATTACH? I saw some blogs describing this usage but not really sure if this is valid...
If there are new discoveries we can discussed here. Thanks again ❤️
By the way, I have another simple error scenario here.
Below are the contents of /proc/{pid}/stat
for two processes.
1981866 (python3) S 1981743 1981866 1981743 0 -1 1077936384 34709484 16140 105 0 72770 37613 11 9 20 0 17 0 372814758 4442140672 55046 18446744073709551615 94888328609792 94888328612516 140731140400576 0 0 0 0 16781312 17642 0 0 0 17 79 0 0 16 0 0 94888330710392 94888330711088 94888354955264 140731140409715 140731140409755 140731140409755 140731140415457 0
1984043 (probe(1981866)) S 1981866 1981866 1981743 0 -1 4194368 1324 0 0 0 169 99 0 0 20 0 3 0 372915682 4424044544 33799 18446744073709551615 94888328609792 94888328612516 140731140400576 0 0 0 0 16781312 17642 0 0 0 17 63 0 0 0 0 0 94888330710392 94888330711088 94888354955264 140731140409715 140731140409755 140731140409755 140731140415457 0
When we try to use py-spy dump --pid 1984043
, then we get the following error Error: Failed to parse /proc/1984043/stat
.
I think the problem should lie in the judgment of this code for the case of '))'
(in remoteprocess)
fn get_active_status(stat: &[u8]) -> Option<u8> {
// find the first ')' character, and return the active status
// field which comes after it
let mut iter = stat.iter().skip_while(|x| **x != b')');
match (iter.next(), iter.next(), iter.next()) {
(Some(b')'), Some(b' '), ret) => ret.map(|x| *x),
_ => None
}
}
I thought I'd make a note of the issue here, though I'm not very familiar with the code that actually executes in process 1984043, I will try to check this and if necessary I'd willing to fix it if I had time.
It's not blocking but keeps triggering waitpid and ptrace::cont in the while loop, really confused me..
That's the same scenario. What you put here is ust another manifestation of the bug.
I'll elaborate more to what I wrote in my previous comment - py-spy attempts to lock all e.g 6 threads of your Python process. It successfully locks 3 and on the 4th, due to a bug, it remains forever in the "reinejcting non-SIGSTOP" loop for thread 4. Meanwhile - threads 5 6 are not locked yet, so they are running "fine", but threads 1 2 3 are already locked - PTRACE_ATTACHed.
What's interesting is that there is a process that I didn't pay attention to before that is monitoring 432's cpu usage, and may periodically call signal.kill(432, SIGSTOP)/signal.kill(432, SIGCONT) to 432, I think there might be a race condition here (although I really haven't figured out why)
This definitely might be related. I have some idea in mind... will need to try it out - perhaps, if the process got a SIGSTOP (and when a process gets a signal, any thread in it might get it randomly, but SIGSTOP behaves differently because it's handled in the kernel) and then py-spy tries to PTRACE_ATTACH while the thread is already in STOPPED state, the ptrace-induced SIGSTOP is never generated and py-spy waits in vain.
I might be able to try it out later today, will write here if so.
Would like to ask your opinion on using PTRACE_SEIZE instead of PTRACE_ATTACH? I saw some blogs describing this usage but not really sure if this is valid...
PTRACE_SEIZE does not stop the process - this is the equivalent of passing --nonblocking
.
When we try to use py-spy dump --pid 1984043, then we get the following error Error: Failed to parse /proc/1984043/stat.
This is a separate problem indeed, it seems that the function doesn't cope with process comms that have )
in them. Not a very common case for Python processes, but I suppose no reason not to support it properly. The logic needs to be "parse from the first ( until the matching )" or maybe just "parse the last )" instead of "the first )".
Can you create a separate issue for that?
I might be able to try it out later today, will write here if so.
Thank you so much for your time~
PTRACE_SEIZE does not stop the process - this is the equivalent of passing --nonblocking.
That's right, I forgot to describe earlier I referenced the following statement from ptrace
"Since Linux 3.4, PTRACE_SEIZE can be used instead of PTRACE_ATTACH. PTRACE_SEIZE does not stop the attached process. If you need to stop it after attach (or at any other time) without sending it any signals, use PTRACE_INTERRUPT command."
so maybe PTRACE_SEIZE+PTRACE_INTERRUPT could instead of PTRACE_ATTACH?
This is a separate problem indeed, it seems that the function doesn't cope with process comms that have ) in them. Not a very common case for Python processes, but I suppose no reason not to support it properly. The logic needs to be "parse from the first ( until the matching )" or maybe just "parse the last )" instead of "the first )". Can you create a separate issue for that?
Sure, i'll create a separate issue.
When py-spy run with py-spy record --pid {pid} --native -r 97 -d 10 -f raw -o xxx.txt
, I try to printing the stack, as below
Thread 3 (LWP 805052):
#0 sccp () at ../src_musl/src/thread/__syscall_cp.c:11
#1 0x00007f72dcd2359f in read () at ../src_musl/src/unistd/read.c:6
#2 0x00007f72dcc23552 in nix::unistd::read::h0aa062cf7d2053b6 ()
#3 0x00007f72dcafe652 in std::sys_common::backtrace::__rust_begin_short_backtrace::hcc9258a46d3c1d8f ()
#4 0x00007f72dcaef386 in core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h991851875476ab0b ()
#5 0x00007f72dccf9533 in _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::h0032718f1ee1e442 () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/alloc/src/boxed.rs:1872
#6 _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::h5bc95b757ddfa29b () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/alloc/src/boxed.rs:1872
#7 std::sys::unix::thread::Thread::new::thread_start::h069db4c5e748eedf () at library/std/src/sys/unix/thread.rs:108
#8 0x00007f72dcd221a3 in start () at ../src_musl/src/thread/pthread_create.c:192
#9 0x00007f72dcd23214 in __clone () at ../src_musl/src/thread/x86_64/clone.s:22
Backtrace stopped: frame did not save the PC
Thread 2 (LWP 805051):
#0 sccp () at ../src_musl/src/thread/__syscall_cp.c:11
#1 0x00007f72dcd2359f in read () at ../src_musl/src/unistd/read.c:6
#2 0x00007f72dcb7b4b7 in maps_next () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#3 0x00007f72dcb7b81b in _Ux86_64_get_elf_image () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#4 0x00007f72dcb7a9a8 in get_unwind_info () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#5 0x00007f72dcb7ab00 in _UPT_find_proc_info () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#6 0x00007f72dcb80359 in fetch_proc_info () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#7 0x00007f72dcb81740 in find_reg_state () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#8 0x00007f72dcb819bc in _Ux86_64_dwarf_step () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#9 0x00007f72dcb7cf5f in _Ux86_64_step () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#10 0x00007f72dcb627ed in _$LT$remoteprocess..linux..libunwind..Cursor$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::next::h8d6ba6e958cd1c39 () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#11 0x00007f72dca69779 in py_spy::python_spy::PythonSpy::_get_pthread_id::h58fdef68ab857262 () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#12 0x00007f72dca6648b in py_spy::python_spy::PythonSpy::_get_os_thread_id::h807756e474dc43d7 () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#13 0x00007f72dca603b4 in py_spy::python_spy::PythonSpy::_get_stack_traces::h6a885e6fcc0586bc () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#14 0x00007f72dca57ece in py_spy::python_spy::PythonSpy::get_stack_traces::h70c11f2d38dbd8c4 () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#15 0x00007f72dcafd0d6 in std::sys_common::backtrace::__rust_begin_short_backtrace::h415d421592b1dec3 () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#16 0x00007f72dcaef007 in core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h1748460e0434bbc0 () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#17 0x00007f72dccf9533 in _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::h0032718f1ee1e442 () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/alloc/src/boxed.rs:1872
#18 _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..FnOnce$LT$Args$GT$$GT$::call_once::h5bc95b757ddfa29b () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/alloc/src/boxed.rs:1872
#19 std::sys::unix::thread::Thread::new::thread_start::h069db4c5e748eedf () at library/std/src/sys/unix/thread.rs:108
#20 0x00007f72dcd221a3 in start () at ../src_musl/src/thread/pthread_create.c:192
#21 0x00007f72dcd23214 in __clone () at ../src_musl/src/thread/x86_64/clone.s:22
Backtrace stopped: frame did not save the PC
Thread 1 (LWP 805050):
#0 0x00007f72dcd1eb7e in __syscall6 () at ../src_musl/arch/x86_64/syscall_arch.h:59
#1 syscall () at ../src_musl/src/misc/syscall.c:20
#2 0x00007f72dccf6775 in std::sys::unix::futex::futex_wait::h65564c44ef71c5aa () at library/std/src/sys/unix/futex.rs:61
#3 0x00007f72dcce98b9 in std::sys_common::thread_parker::futex::Parker::park::h1857a2d51b91109b () at library/std/src/sys_common/thread_parker/futex.rs:52
#4 std::thread::park::he5cdd1e0067c0814 () at library/std/src/thread/mod.rs:929
#5 0x00007f72dccf22a2 in std::sync::mpsc::blocking::WaitToken::wait::h7da894c0826b7e66 () at library/std/src/sync/mpsc/blocking.rs:67
#6 0x00007f72dca6df78 in std::sync::mpsc::stream::Packet$LT$T$GT$::recv::h6d1e110c42068000 () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#7 0x00007f72dcb014af in std::sync::mpsc::Receiver$LT$T$GT$::recv::hb203c62da94a10a9 () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#8 0x00007f72dcb0d05f in py_spy::run_spy_command::h059726825bd45700 () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#9 0x00007f72dcb0ed14 in py_spy::main::h523879457f122e9a () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#10 0x00007f72dcafe5d3 in std::sys_common::backtrace::__rust_begin_short_backtrace::h961c1afc5ca03793 () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#11 0x00007f72dca7e209 in std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::hd9822804426d1ce6 () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
#12 0x00007f72dcce8fa1 in core::ops::function::impls::_$LT$impl$u20$core..ops..function..FnOnce$LT$A$GT$$u20$for$u20$$RF$F$GT$::call_once::h18fd15a330c116b9 () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/ops/function.rs:280
#13 std::panicking::try::do_call::he918a328b783a420 () at library/std/src/panicking.rs:492
#14 std::panicking::try::h88ecc043ee71c652 () at library/std/src/panicking.rs:456
#15 std::panic::catch_unwind::h074c8fe23976f009 () at library/std/src/panic.rs:137
#16 std::rt::lang_start_internal::_$u7b$$u7b$closure$u7d$$u7d$::he26a080dffc4764a () at library/std/src/rt.rs:128
#17 std::panicking::try::do_call::hf9be9f327f382512 () at library/std/src/panicking.rs:492
#18 std::panicking::try::h163c4b2413cf9165 () at library/std/src/panicking.rs:456
#19 std::panic::catch_unwind::h58a40b305e945d4e () at library/std/src/panic.rs:137
#20 std::rt::lang_start_internal::h183e60933887f95a () at library/std/src/rt.rs:128
#21 0x00007f72dcb0fb02 in main () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181
In my understanding, Thread1 wait for recv
to get stacks from Thread2, Thread2 in maps_next
to get native stack, but what I don't quite understand is where thread3 is spawned and what is it doing, could you guide me some details?
If I think the delay of record samples is pretty serious, that may be from Thread2?
Thanks~
When try to sample the python process with the following command
py-spy record --pid {pid} --native -r 97 -d 10 -f raw -o xxx.txt
Sometimes py-spy process blocks for a long time with no results, I tried to analyze it, and I found that the thread was blocked in
waitpid
while locking the python threadWhen run with the environment variable
RUST_LOG=debug
, i gotLooks like it's blocking while locking the python thread, but this doesn't always happen, sometimes I can successfully get the results.
I would like to know if there is a known reason or if there is any suggestion for me, thank you very much!