Tracing mixed Python / C++ code?

amzn / hawktracer

HawkTracer is a highly portable, low-overhead, configurable profiling tool built in Amazon Video for getting performance metrics from low-end devices.

MIT License

132 stars 31 forks source link

Tracing mixed Python / C++ code? #46

Closed drewm1980 closed 5 years ago

drewm1980 commented 5 years ago

Has anyone written an integration for tracing python code? If so, do you have a reference?

Greetings from Belgium, by the way; I'm the American guy who asked you a couple questions at FOSDEM.

Cheers, Andrew

loganek commented 5 years ago

Hi @drewm1980 I'm glad you consider using HawkTracer :) I've never written a code for tracing python, and I haven't heard of anybody who did that. I also didn't have plan to do this in the nearest future. However, since there's a potential use of that, I'm very happy to implement that functionality, or at least assist you with doing that. I'd need some more input first, so we can have a proper design that fits your needs, but also is generic enough so other people can benefit from that. Could you please tell me what exactly would you like to trace? Are you only interested in measuring time spent in a particular function, or you want to measure time of arbitrary code? Do you want to be able to define custom classes in python, or having this feature in C++ is enough for now for you? How do you integrate your C++ code with python? Are you just running python interpreter inside your C++ application? I'd like to know a bit more what's your usecase, so I can deliver a solution that works best for you :)

drewm1980 commented 5 years ago

We're writing soft real-time software for robotic pick and place machines, with cycle times usually 100-200 ms. We'd like to see visually when operations are taking longer than we expect, threads are contending, cpu is idle while waiting for the GPU, etc... in the name of reducing worst case latency for analyzing one frame of data. Python is our glue language and calls into C++ code we wrote, and Tensorflow.

Some other profilers we're looking at: RAD games Telemetry (intrusive, native only) http://www.radgametools.com/telemetry.htm That looks like the best intrusive profiler out there, but it's priced for AAA games, so you need a solid business case to use it at all. I saw a video of that in one of Jonathan Blow's live coding videos, and it's what inspired me to improve our profiling tooling past CProfile + Snakeviz. Our ideal would be that, but without the need for manual instrumentation, and have python and native symbols/call stacks resolved correctly.

Uber's pyflame and Ben Frederickson's py-spy (sampling, mixed Python/native) are very promising, but they're very new (as in Ben added the native call stack unwinding parts only last week) and I had issues getting them working in our environment. https://github.com/uber/pyflame/issues/165 https://github.com/benfred/py-spy/issues/86

easy_profiler (intrusive, native only, with a gui) has been around longer, but I couldn't get the client and the code to connect: https://github.com/yse/easy_profiler

We may try perf + cpuprofilify (sampling, native): https://github.com/thlorenz/cpuprofilify

I evaluated VTune; it has some python integration, but it doesn't seem to have a fully functional timeline view; as a tool it seems more focused on reducing overall cpu time, rather than minimizing worst case latency.

There are also big libraries like LTTNG and SystemTap for system-wide profiling, and big libraries for HPC stuff, but it's not clear how to even get started with them.

Nvidia's and AMD's profilers are understandably focused on profiling their GPU's, and don't show much, if anything for the cpu.

Anyway, thanks for the response! We did get hawktracer working on a C++ function in our code; thanks for writing it!

loganek commented 5 years ago

Hi @drewm1980 thanks for the explanation. So it looks like you're basically interested in low-overhead intrusive function-level time profiling for more than just the native stack - that's what HawkTracer was designed for, as our usecase was very similar, but except python, we had lua/javascript.

I'm very happy to work with you getting python bindings to work in HawkTracer.

I see three possible solutions for that:

We expose Start(label)/Stop() functions to python script, so you can do something like:
```
def foo():
hawktracer.Start("some_label")
# your python code here
hawktracer.Stop()
```
This is just modification of the 1). We keep exposing Start()/Stop() methods, but in your code instead of calling those methods directly, you write them as a comment:
```
def foo():
# TracepointStart some_label
# your your python code here
# TracepoinStop
```
Then you have generator that scans the code, and turns it into the code shown in the 1). The advantage of this solution is that in the production code you can very easily completely disable HawkTracer, and just have a special build target where you enable those tracepoints. The disadvantage of this solution is complexity, as you need to have 2 build targets, script for scanning and fixing the code etc. Other option to reduce overhead (to almost zero) in the production code is to just keep the code as it was in 1), and change C++ implementation of Start()/Stop() method to empty functions if you don't want to run the profiler (it could be e.g. some kind of ifdef
Not sure if that's possible in python, but in LUA you can set function call/function exit hooks (http://pgl.yoyo.org/luai/i/lua_sethook) so you can easily trace all the functions called in python. As I said, I'm not sure if that's possible in python. We could have some kind of filtering, in case there's too many tracepoints that we're not interested in.

regarding 1) and 2) - if you want to measure the scope, we could have a class with the deleter, and measure the scope using with statement: For 1), it'd be:

def foo():
  with HawkTracerTracepoint("label") as tp:
    # your python code

For 2):

def foo():
  # TraceScope label
  # your python code

and then have a script that turns the code into 1).

We could also mix 1)/2) with 3).

Please let me know what are your thoughts on that, and if you have any other ideas :)

drewm1980 commented 5 years ago

I'm not sure what the most ergonomic or efficient API for python is... function decorators are likely the pythonic way to do it. The interpreter already has some mechanism for instrumenting at the function level, though, through cProfile. If there's a way of setting up hawktrace hooks for specific functions at module import time, I bet it would be more efficient than having python function decorator code call the C++ functions through an FFI at function call time... But keep in mind I know very little about how the python interpreter works internally.

cProfile has a reputation for high overhead, though we haven't measured it on our project yet (one of our next steps). Maybe restricting it only to instrument functions you care about (and log them through hawktrace so the log is shared with native code) would be fast enough for most uses.

We already spent probably too much dev time on tooling; not sure how much more time we can spend in the short term. We'll likely focus on improving coverage of the C++ parts of our code for now.

loganek commented 5 years ago

Then how about we implement a decorator, something like:

import os
import HawkTracer

def HawkTracerTrace(func):
    if HawkTracerTrace.tracing_enabled is None:
        HawkTracerTrace.tracing_enabled = 'HAWKTRACER_TRACE' in os.environ
        print(HawkTracerTrace.tracing_enabled)

    def wrapper(*args, **kwargs):
        if HawkTracerTrace.tracing_enabled is None:
            HawkTracerTrace.tracing_enabled = 'HAWKTRACER_TRACE' in os.environ

        if HawkTracerTrace.tracing_enabled:
            HawkTracer.start(func.__name__)
            func(*args, **kwargs)
            HawkTracer.stop()
        else:
            func(*args, **kwargs)

    return wrapper

HawkTracerTrace.tracing_enabled = None

Then whenever you want to instrument your function, you can do so:

@HawkTracerTrace
def func1():
    print("Func1()")

@HawkTracerTrace
def func2():
    print("Func2()")

? You can control whether the attribute is enabled or disabled through env variable. Also, you can decide which function you want to trace, as you need to manually instrument your code.

drewm1980 commented 5 years ago

Hi, I think it's better to do the os.environ check outside of the wrapper, and if false, just return the function unmodified (I believe the decorator is evaluated when the interpreter hits the "@"). Otherwise you're adding control flow and environment variable checks even when instrumentation is supposed to be turned off. How about:

in hawktracer.py:

#!usr/bin/env python3

import os
#_tracing_enabled = False
_tracing_enabled = True
if 'HAWKTRACER_TRACE' in os.environ:
    _tracing_enabled = True

def enable():
    _tracing_enabled = True

def disable():
    _tracing_enabled = False

def _start(region_name):
    # Actual code that starts a hawktracer region
    pass

def _stop():
    # Actual code that stops a hawktracer region
    pass

def start(region_name):
    if _tracing_enabled:
        print("Started a trace region called",region_name,'...')
        _start(region_name)
    else:
        print("NOT Starting a trace region called",region_name,'...')

def stop():
    if _tracing_enabled:
        print("Stopped a trace region")
        _stop()

def trace(func,_tracing_enabled=_tracing_enabled):
    if _tracing_enabled:
        print("Instrumenting function",func.__name__)
        def wrapper(*args, **kwargs):
            _start(func.__name__)
            func(*args, **kwargs)
            _stop()
        return wrapper
    else:
        print("NOT instrumenting function",func.__name__)
        return func

in user code:

#!/usr/bin/env python3

import hawktracer
from hawktracer import trace

hawktracer.enable()

from time import sleep

@trace
def foo():
    pass

from hawktracer import start, stop
start("bar")
stop()

hawktracer.disable()

@trace
def baz():
    pass

The above code has a bug I haven't worked out yet... enable() and disable() don't have an effect. Maybe you already tried something like that and it's why you did what you did. There's probably another python intrusive profiler out there that has worked this issue out...

loganek commented 5 years ago

Hi @drewm1980 Thanks a lot for suggestions. I'm aware the code I wrote is not ideal, but the main thing I wanted to point out is the end-user experience. So I assume we both agree that the api should be:

@trace
def foo():
  # this function is going to be traced when called
  pass

hawktracer.enable() # tracing enabled
foo() # this will generate the tracepoint
hawktracer.disable() # tracing disabled
foo() # this function call won't generate tracepoints, because hawktracer is disabled

Not sure about the function names/namespaces yet, but more or less it'll be something like above. If you agree with that design, I'll prototype it over the weekend. If the implementation is not efficient, we can iterate and improve it, but at least the functionality is going to be there and the API will be available for customers (including yourself:) ). Please let me know what do you think about that, so we can start working on that soon.

drewm1980 commented 5 years ago

I'm likewise unsure about what people end up really needing in practice, i.e. if they're more concerned about being able to enable/disable tracing:

globally across python (at import time) and native(at compile time)
globally at import time within python
at import time for groups of functions (or modules?)
at function call time for functions (or modules?)

Each project probably will have an opinion on how to do 0 depending on their build environment. I was trying to make 1. and 2. work. I think your last example is trying to make 3. work.

Do you have internal python users at Amazon to drive your development priorities? Or do you have aspirations for hawktracer to grow into some sort of tool for AWS/cloud? The ordering above would probably be mine, but you're not working for us :)

Being able to promise "no overhead in instrumented functions when disabled" is an important promise to be able to make to put people's minds at ease when sprinkling instrumentation code all over their codebase. I think the way I wrote the decorator probably achieved that... in C++ you probably want to have a preprocessor flag that ensures hawktrace isn't in the compiled code when disabled.

It's the #2 FAQ on Telemetry's web page: http://www.radgametools.com/telemetry/faq.html

loganek commented 5 years ago

hi @drewm1980 Right now there's no usecase at Amazon. However, we have very similar situation for JavaScript and LUA. I don't see much difference between my proposal and yours - they both require instrumenting the function you want to trace, and they both do it through annotations, i.e.:

@trace
def foo():
  pass

The way it's going to be implemented is not that important for me right now - it's easy to change the implementation, but difficult to change the API. I can guarantee that we do our best to introduce as little overhead (ideally, zero overhead) as possible when user decides to disable the HawkTracer system. But as I said, it's less important than deciding the API itself. Having said that, looks like the function annotations is the API we both agree, so if there's no objection, I'll do a prototype over the weekend. :)

drewm1980 commented 5 years ago

I of course have no objection!

I have a point to make RE the implementation affecting the API, but I found it easier to write the code than to explain it clearly. I'll fork and create a pull request.

drewm1980 commented 5 years ago

Here's the pull request:

https://github.com/amzn/hawktracer/pull/47

No offense if you don't want to pull for whatever reason; cutting and pasting here was just getting clunky. I got 2. working, but it required changing how the API is used (see code). 3. is broken.

loganek commented 5 years ago

Hi @drewm1980 your proposal looks good. I have a few improvements though that would allow us to do 3 as well. I changed the _trace_ function:

def _trace(func):
    global _tracing_enabled

    print("Instrumenting function",func.__name__)
    def wrapper(*args, **kwargs):
        if _tracing_enabled:
            _start(func.__name__)
        func(*args, **kwargs)
        if _tracing_enabled:
            _stop()
    return wrapper

so it checks in the wrapper if tracing is enabled - it's only an overhead (one if statement) if function was instrumented (i.e. decorators were enabled), and it's just one condition, which is not significant compared to other hawktracer stuff.

I also changed enable_decorator() method:

def enable_decorator():
    global _tracing_enabled
    global trace
    if _tracing_enabled:
        trace = _trace

so it only enables the decorator when the tracing is enabled. After that, the example from your pull request works as expected.

I also changed default value of _tracing_enabled (I set it to False by default) and default value of trace variable (I set it to _identity).

What do you think about those changes?

loganek commented 5 years ago

Hi @drewm1980 I've done a prototype over the weekend on my personal github account. Please have a look and let me know what do you think about it. Here is example usage: https://github.com/loganek/hawktracer/blob/python-lib-bindings/bindings/python3/instrumentation-example.py And here's the implementation: https://github.com/loganek/hawktracer/blob/python-lib-bindings/bindings/python3/hawktracer_core_python.cpp

I'm very open for any suggestion, feel free to make pull request to that branch as well. Once we think it's ready, I'll merge it to this repository's master branch.

drewm1980 commented 5 years ago

Hi @loganek Very cool! So it looks like for the decorator disabling code you're doing what my python code was doing, but ported to C:

static PyObject *
ht_python_core_trace(PyObject* Py_UNUSED(self), PyObject* args)
{
    PyObject *traced_function;

    if (!PyArg_ParseTuple(args, "O", &traced_function))
    {
        return NULL;
    }

    if (trace_method)
    {
        return PyCFunction_New(trace_method, traced_function);
    }
    else
    {
        Py_XINCREF(traced_function);
        return traced_function;
    }
}

Does having the state for whether the decorator is enabled in C mean that users can do import HawkTracer.Core.trace as trace and then just do @trace and have decorator disabling still work?

static PyObject*
ht_python_core_trace_function(PyObject* function, PyObject *args)
{
    uintptr_t address = 0;

    if (!PyCallable_Check(function))
    {
        Py_RETURN_NONE;
    }

    if (tracing_enabled)
    {
        PyObject* function_name_attr = PyObject_GetAttrString(function, "__name__");

        if (function_name_attr)
        {
            address = ht_python_core_get_label_address(function_name_attr);
            ht_feature_callstack_start_int(ht_global_timeline_get(), address);
            Py_DECREF(function_name_attr);
        }
    }

    PyEval_CallObject(function, args);

    if (address)
    {
        ht_feature_callstack_stop(ht_global_timeline_get());
    }

    Py_RETURN_NONE;
}

I'm studying how you're caching the function names... I'm wondering if there will be issues with functions that have the same name... in particular, in python multiple instances of the same string can end up sharing memory as an optimization, and this even holds for function names, in or out of a class:

In [7]: x = 'foo'                                                                                   

In [8]: y = 'foo'                                                                                   

In [9]: id(x)                                                                                       
Out[9]: 140149505679912

In [10]: id(y)                                                                                      
Out[10]: 140149505679912

In [13]: def f(): 
    ...:     pass 
    ...:                                                                                                  

In [14]: g = f                                                                                            

In [15]: def f(): 
    ...:     pass 
    ...:                                                                                                  

In [16]: g.__name__                                                                                       
Out[16]: 'f'

In [17]: f.__name__                                                                                       
Out[17]: 'f'

In [19]: id(g.__name__)                                                                                   
Out[19]: 140149571706928

In [20]: id(f.__name__)                                                                                   
Out[20]: 140149571706928

In [26]: class Foo(): 
    ...:     def __init__(self): 
    ...:         pass 
    ...:     def f(self): 
    ...:         pass 

In [28]: id(Foo.f.__name__)                                
Out[28]: 140149571706928

That seems contrived, but in my codebase, there are multiple functions with the same name as members of various classes in various modules. If I understand correctly, your tracepoint_map is using function names as UUID's, and that's not going to work in a lot of python code bases; python's not like C where function names have to be unique or they collide, or like in C++ where they're transparently mangled so they don't collide when there is overloading.

Could tracepoint_map be constructed in the decorator function, so that the overhead of setting it up is at decoration-time, rather than the first run through the code?

Dunno if it's significant or not; just wanted to send some quick feedback after skimming the code. My team member who tried out hawktracer on C++ in our code will give your new python wrapper a shot.

loganek commented 5 years ago

Hi @drewm1980

Hi @loganek Very cool! So it looks like for the decorator disabling code you're doing what my python code was doing, but ported to C

Yes, it's exactly like that, with a few little improvements.

Does having the state for whether the decorator is enabled in C mean that users can do import HawkTracer.Core.trace as trace and then just do @trace and have decorator disabling still work?

Yes, I've tested various scenarios, but feel free to double-check.

I'm studying how you're caching the function names... I'm wondering if there will be issues with functions that have the same name... in particular, in python multiple instances of the same string can end up sharing memory as an optimization, and this even holds for function names, in or out of a class: That seems contrived, but in my codebase, there are multiple functions with the same name as members of various classes in various modules. If I understand correctly, your tracepoint_map is using function names as UUID's, and that's not going to work in a lot of python code bases; python's not like C where function names have to be unique or they collide, or like in C++ where they're transparently mangled so they don't collide when there is overloading.

In general I don't see any problem with using the same map entry for 2 different functions - as long as the name is correct, it's fine. But as far as I understand, you suggest to not use __name__ property as it might be confusing for the user? If so, what would you suggest instead? Ideally, I'd like to have the full name (including module, e.g. module.submodule.function but not sure how to get it in python (although I'm sure there's a way to do that). Also, please note I use python dict object instead of c++ unordered_map. The reason why is that initially I was doing it in C, therefore didn't have access to STL. I might change it in the future, but haven't decided yet.

Could tracepoint_map be constructed in the decorator function, so that the overhead of setting it up is at decoration-time, rather than the first run through the code?

Yeah, I think that'd be a nice improvement, which could be easily implemented (I hope). As I said in one of my previous comments, I'm more focusing on the interface and user experience first, so I didn't pay much attention to actual implementation. There might still be quite a few potential improvements in my code :)

Dunno if it's significant or not; just wanted to send some quick feedback after skimming the code. My team member who tried out hawktracer on C++ in our code will give your new python wrapper a shot.

Any feedback is very appreciated, especially as constructive as yours. Thanks a lot, and looking forward too see more of your comments :)

drewm1980 commented 5 years ago

Does having the state for whether the decorator is enabled in C mean that users can do import HawkTracer.Core.trace as trace and then just do @trace and have decorator disabling still work?

Yes, I've tested various scenarios, but feel free to double-check.

cool! When I showed your example code to another developer, his initial impression was "I have to do all this stuff?". Of course it's not, but it would probably be good for adoption to have a minimal_example.py file that just traces one function without showing off any other optional features.

In general I don't see any problem with using the same map entry for 2 different functions - as long as the name is correct, it's fine. But as far as I understand, you suggest to not use __name__ property as it might be confusing for the user? If so, what would you suggest instead? Ideally, I'd like to have the full name (including module, e.g. module.submodule.function but not sure how to get it in python (although I'm sure there's a way to do that). Also, please note I use python dict object instead of c++ unordered_map. The reason why is that initially I was doing it in C, therefore didn't have access to STL. I might change it in the future, but haven't decided yet.

You can disregard that comment for now; we'll let you know if it's an issue in practice. If the user annotates their own nested regions, that turns into something like a call stack in the output anyway, so maybe the name shadowing isn't an issue as long as hawktracer still gets the nesting right when regions have the same name.

Any feedback is very appreciated, especially as constructive as yours. Thanks a lot, and looking forward too see more of your comments :)

No need to thank me; you're doing all the hard work!

loganek commented 5 years ago

would probably be good for adoption to have a minimal_example.py file that just traces one function without showing off any other optional features.

Yeah, the minimalistic example would be:

import HawkTracer.Core

@HawkTracer.Core.trace
def foo():
    print("This is foo.")

foo()

and run the program with HT_PYTHON_TRACE_ENABLED set. I'll create another, simplified example.

You can disregard that comment for now; we'll let you know if it's an issue in practice. If the user annotates their own nested regions, that turns into something like a call stack in the output anyway, so maybe the name shadowing isn't an issue as long as hawktracer still gets the nesting right when regions have the same name.

Sure, let me know if you have any real issues, so we can work to fix it.

I think I'll merge my changes soon. I want to re-review it and fix potential problems. After that, I'll close the issue. Feel free to open another one, if you find some bugs or you have any future feature requests :)

drewm1980 commented 5 years ago

This is bikeshedding, but IMHO it's prettier as:

from HawkTracer.Core import trace

@trace
def foo():
    print("This is foo.")

foo()

Doesn't it still need an init function call? Or is that already handled in the import? If so, nice!

Also bikeshedding, but python has a convention for package and module name capitalization:

From PEP 8: Package and Module Names Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves readability. Python packages should also have short, all-lowercase names, although the use of underscores is discouraged.

It's not uncommon for users to just start typing stuff in an ipython shell to discover an API. Such users will try the lowercase module name first, curse, maybe check your docs, then discover the capitalized version.

loganek commented 5 years ago

This is bikeshedding, but IMHO it's prettier as:

Sure, agree.

Doesn't it still need an init function call? Or is that already handled in the import? If so, nice!

It's not upstream yet, but I have that change locally.

It's not uncommon for users to just start typing stuff in an ipython shell to discover an API. Such users will try the lowercase module name first, curse, maybe check your docs, then discover the capitalized version.

Good to know, will rename it to hawk_tracer.core then

loganek commented 5 years ago

Hey @drewm1980 I finished the code and opened a PR https://github.com/amzn/hawktracer/pull/48 . Please review, any suggestion is very appreciated :)

loganek commented 5 years ago

PR #48 has been merged, closing the issue. Feel free to open a new one in case there's some missing functionality.