NVIDIA / NVTX

The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resources in your applications.
Apache License 2.0
312 stars 48 forks source link

Implement the `NVTX3_FUNC_RANGE_IF_IN` and `NVTX3_FUNC_RANGE_IF` macros #28

Closed AntoineFroger closed 3 years ago

AntoineFroger commented 3 years ago

The NVTX++ interface currently have the NVTX3_FUNC_RANGE and NVTX3_FUNC_RANGE_IN macros which allow the generation of an NVTX range from the lifetime of the block.

There are scenarios where we only want to conditionally generate NVTX annotations. For example, some developers might want to annotate their libraries but have some kind of verbosity control. In this case, they might want to control whether an annotation is emitted or not dynamically.

One possible implementation would be to add an additional class similar to domain_thread_range which would enable the move constructor. The macro implementation would then be the following:

#define NVTX3_V1_FUNC_RANGE_IF_IN(C, D) \
  ::nvtx3::v1::movable_domain_thread_range<D> const nvtx3_range__{}; \
  if (C) { \
    static ::nvtx3::v1::registered_string<D> const nvtx3_func_name__{__func__}; \
    static ::nvtx3::v1::event_attributes const nvtx3_func_attr__{nvtx3_func_name__}; \
    nvtx3_range__ = ::nvtx3::v1::movable_domain_thread_range<D>{nvtx3_func_attr__}; \
  }

If the user wants the condition to only be evaluated once for the whole duration of the program execution, he can cache the result in a static variable.

The downside of making a class that allows a thread range to be movable is that it can allow misuse of the API. For e.g. a user might create such thread range an move it into a functor which is executed in another thread. If this is problematic, this class could be implemented into the detail namespace and documented to warn the users of those invalid cases.

AntoineFroger commented 3 years ago

I have some code written to implement those macros. Just need to test it before I submit it for review.

jrhemstad commented 3 years ago

The other thing this would require is for the default ctor of movable_domain_thread_range to emit no range. I think we avoid introducing a whole new range type and avoid making thread_range movable by using an IILE and introducing a tag type in domain_thread_range::no_range with a domain_thread_range::domain_thread_range(domain_thread_range::no_range) ctor that emits no range. Something like this:

#define NVTX3_V1_FUNC_RANGE_IF_IN(C, D) \
  ::nvtx3::v1::movable_domain_thread_range<D> const nvtx3_range__ = [&](){
  if (C) { \
    static ::nvtx3::v1::registered_string<D> const nvtx3_func_name__{__func__}; \
    static ::nvtx3::v1::event_attributes const nvtx3_func_attr__{nvtx3_func_name__}; \
    return ::nvtx3::v1::domain_thread_range<D>{nvtx3_func_attr__}; \
   }  else{
      return ::nvtx3::v1::domain_thread_range<D>{::nvtx3::v1::domain_thread_range::no_range};
   } 
};
}
jrhemstad commented 3 years ago

Hm, on second thought, my IILE idea won't work because it still requires the object to be movable...

Can we just use start_range/end_range here instead?

AntoineFroger commented 3 years ago

Hm, on second thought, my IILE idea won't work because it still requires the object to be movable...

If we don't touch domain_thread_range and add the movable_domain_thread_range this might not be an issue. What do you think?

Can we just use start_range/end_range here instead?

I would rather not for three reasons:

  1. Tool might process and visualize push/pop and start/end ranges differently. This is the case for Nsight Systems
  2. Semantically, NVTX3_V1_FUNC_RANGE and NVTX3_V1_FUNC_RANGE_IN represent a thread range, not a process range. So it would be weird to rely on domain_process_range for the IF variant
  3. The overhead for start/end might be slightly higher than for push/pop. This is the case for Nsight Systems
jrhemstad commented 3 years ago

If we don't touch domain_thread_range and add the movable_domain_thread_range this might not be an issue. What do you think?

Adding a whole additional type for this one use case is kind of a lot of code.

AntoineFroger commented 3 years ago

Adding a whole additional type for this one use case is kind of a lot of code.

We can create a base domain_thread_range_base class which contains most of the shared code (relevant constructors & deleted operators). The domain_thread_range and movable_domain_thread_range could inherit from it and only implemented the pieces specific to their class (move ctor/operator, destructor)

jrhemstad commented 3 years ago

I still don't love adding a new range type if we can avoid it. Here's how we can do it just by adding a tag type to pass to domain_thread_range that indicates no range should be generated: https://godbolt.org/z/v9MYzc

jrhemstad commented 3 years ago

Also, I'm wondering about if nvtx3.hpp really needs to provide this macro. I'm not doubting if it's useful, I'm just questioning if it really needs to be provided by the library.

There are no doubt numerous variants of this macro some people would find useful, e.g., variants that allow specifying parts of the event_attributes. The lack of overloading on macros makes this nasty as we then have to have different names for all of them. As a rule, I think nvtx3.hpp should provide the bare minimum number of convenience macros to avoid bloat and complexity. The question is, do we think NVTX3_FUNC_RANGE_IF and NVTX3_FUNC_RANGE_IF_IN are essential enough to be considered bare minimum.

Also the naming of NVTX3_FUNC_RANGE_IF_IN is potentially confusing. It sounds like the range would only be generated if it's in something.