Trying to implement better "active" threads detection in nonblocking mode

Hi,

In '_get_os_thread_id' function there's a check that returns None when running in non-blocking mode. This results in a heuristic thread-activity test which might eventually lead to inaccurate results. As the comment states, this strategy is chosen due to the difficulty of matching the os thread ids and the pthreads ids reliably (can't use ptrace to do the unwinding and find rbx value in relevent stack frame). Recently I was thinking of a work around for this issue for linux environments. The idea is to find another way to the pthread_id to os_thread_id matching.

I've noticed that the '/proc/{TID}/syscall' path (available since Linux 2.6.27), exposes information about the syscall that is currently being executed. One of its values is the value of the usermode stack pointer when the syscall was initiated. For each TID having that value, we could use proc-maps to find complete stack range of that thread. Next thing is to iterate its stack from bottom up, and for each value check whether it is contained in the pthread_ids hashset. Since the pthread_id is pushed to the stack early in thread initialization, we can expect to find it before reaching the stack-top. To my understanding, we can assume that the first matching pthread_id will indeed be the one corresponding to the current os_tid, as finding a wrong pthread_id on the threads stack is extremely unlikely.

This method can match pthread_ids and os_tids as desired but would work only for threads during a syscall (in this case '/proc/{TID}/syscall' will return 'running'). However, notice that waiting threads will always fall into this category, so the only problematic threads are the ones that are currently running. A consequence of that is that an idle thread would always be identified as such, thus eliminating its stack from the trace (which is pretty much the problem I wanted to overcome when first started this). But we also may do a little better to identify running threads using the following approaches:

Try this method at each stack trace iteration for missing threads, hopefully catching them during a syscall at some point (meanwhile using the current thread-activity heuristic)
For the main thread we can also read '/proc/{TID}/stat' which contains startstack which is the stack bottom. so for the main thread we can use that if it is in 'running' state.
If there's only one pthread_id and one os_tid left without a match, we can match them. I'd guess that at many situations there's only one 'running' thread at a time.

Using this I think we can heavily reduce the amount of time the heuristic is applied, leading to more accurate results. I have implemented the following method, and have gotten good results. Consider the following code snippet:

def test(): def half_sleep(): i = 0 while True: sleep(3) for j in range(10000000): i *= 1

random_data = os.urandom(10000000)
def compress_data():
    while True:
        zlib.compress(random_data)

for thread in [Thread(target=compress_data), Thread(target=half_sleep)]:
    thread.start()

The program was sampled in 3 modes:

non-blocking mode: All threads are marked active at each sample as it is impossible to match the pthread_id to os_thread_id (os_thread_id is needed to check if active). As expected, results show that each of the threads takes about a third of the flamegraph.
non-blocking & gil mode: This could help for some programs who runs python-native code sololy, but in this case the results are actually less accurate as most of zlib.compress runs in native C code, which doesn't hold the gil. The distribution in this case was: ~87% for half_sleep thread, ~13.5% for compress_data thread.
Using the above method distribution was: ~91% for compress_data thread, ~9% for half_sleep thread.

I'd be happy to share\merge my code if you find it useful, and anyway, I would be happy to hear your opinion and thoughts on this. Thanks in advance

benfred / py-spy

Trying to implement better "active" threads detection in nonblocking mode #480