benfred / py-spy

Sampling profiler for Python programs
MIT License
12.51k stars 413 forks source link

Profiling native threads? #332

Open SimonSapin opened 3 years ago

SimonSapin commented 3 years ago

Does py-spy record ignore threads that don’t contain any Python stack frame by default?

I have a Python program with a native extension (that happens to be written in Rust). That extension starts a thread (with Rust’s std::thread::spawn) to do some CPU-intensive work in parallel with other work. The child thread never runs a Python interpreter. The SVG output of the profiler is missing everything in the second thread. --native does show Rust stack frames, but only in the parent thread. Adding --threads adds the ID of the parent thread to the output but nothing else. Adding --idle doesn’t seem to change anything for this program.

When using py-spy dump --pid (at the right time) however, the stack of both threads is printed correctly.

Can I use py-spy to profile both threads?

benfred commented 3 years ago

Not right now =( We merge the native stack traces into python frames - but not vice versa. You'll have to profile with other native profiling tools like perf etc to get profile the native thread

SimonSapin commented 3 years ago

That’s unfortunate. Can you say more about this merging? Does it need to happen?

ogrisel commented 3 years ago

Indeed that would be very helpful to have py-spy handle native threads in the reporting to understand the performance of CPU intensive Python programs that use datascience libraries like numpy that rely on multi-threaded linear algebra native libraries such as OpenBLAS, MKL and co.

Same for machine learning libraries like scikit-learn, lightgbm and xgboost that use OpenMP threads in the CPU intensive sections of the code written in Cython or C++.

At the moment profiling with py-spy --native --threads --format speedscope and loading the results into the speedscope visualizer makes no sense to me...

Jongy commented 3 years ago

We're using libunwind-ptrace in PyPerf and we just place native frames on top of the Python frames (stopping at the first native frame that is the PyEval_EvalFrame* which belong to the topmost Python function). For a truly native thread with no Python frames, we will just have its native stack.

IIRC py-spy uses libunwind-ptrace as well? So this rather simple scheme could work.

ogrisel commented 3 years ago

Not right now =( We merge the native stack traces into python frames - but not vice versa. You'll have to profile with other native profiling tools like perf etc to get profile the native thread

@benfred It would be great to have native thread in py-spy: in my case, some of those native threads are managed by OpenMP via Cython prange loops: in this case they can call Cython functions and py-spy Cython support would be very handy.

Furthermore, if speedscope ever supports multitrack views with time-aligned traces, it would be very helpful to understand when those native threads come into play and interact with the calling Python code.

Would @Jongy's suggested solution above work?