NVIDIA / nvbench

CUDA Kernel Benchmarking Library
Apache License 2.0
525 stars 66 forks source link

[FEA] The `--profile` flag should disable CUPTI metrics #99

Open jrhemstad opened 2 years ago

jrhemstad commented 2 years ago

The --profile flag executes a benchmark a single time to enable profiling a benchmark with tools like Nsight Systems and Nsight Compute.

These tools are incompatible with concurrent use of CUPTI:

When profiling a framework or application that uses CUPTI, like some versions of TensorFlow(tm), Nsight Systems will not be able to trace CUDA usage due to limitations in CUPTI. These limitations will be corrected in a future version of CUPTI. Consider turning off the application's use of CUPTI if CUDA tracing is required.

Therefore, the --profile flag should also disable any use of CUPTI for gathering metrics.

davidwendt commented 2 years ago

I looked into this a bit and was not able to find CUPTI calls occurring when the --profile flag is used. Analysis started here: https://github.com/NVIDIA/nvbench/blob/2ce4e425eeaf7453ee10ead99f6408d41c733813/nvbench/option_parser.cu#L427-L430 Where the --profile flag is used to set the run_once and disable_blocking_kernel state flags.

I then traced the logic to here: https://github.com/NVIDIA/nvbench/blob/2ce4e425eeaf7453ee10ead99f6408d41c733813/nvbench/detail/state_exec.cuh#L119-L139 Note that even if is_cupti_required() is true that this path will not execute if run_once is enabled.

Further tracing into nvbench::detail::measure_cold did not show any CUPTI calls also if run_once is enabled.

Finally, I found that all the CUPTI calls in nvbench are encapsulated in https://github.com/NVIDIA/nvbench/blob/main/nvbench/cupti_profiler.cxx where all the API calls are checked by https://github.com/NVIDIA/nvbench/blob/2ce4e425eeaf7453ee10ead99f6408d41c733813/nvbench/cupti_profiler.cxx#L41 I added a printf into this function and it never printed when using --profile

Of course, I may have missed something and would welcome any feedback on the above.

PointKernel commented 2 years ago

IIUC, CUPTI will be used only if any of these auto throughput measurements are required. https://github.com/NVIDIA/nvbench/blob/5d70492714d05f2207e2193be8a8cc0a85eefc76/examples/auto_throughput.cu#L64-L68

We need to explicitly set the below bools to false when --profile is present https://github.com/NVIDIA/nvbench/blob/1a13a2e724b8aa8aee27649ac6878babb63862a6/nvbench/state.cuh#L288-L292

jrhemstad commented 2 years ago

@PointKernel is correct. CUPTI collection will only occur when a benchmark explicitly opts in via the collect_dram_throughput(), etc.

The --profile flag should override a benchmark that uses collect_dram_throughput().

In fact, it may make sense to just disable any output all together when using --profile. My intuition is that someone using --profile doesn't care about the output from nvbench.

davidwendt commented 2 years ago

These are all wrapped by a single is_cupti_required() call https://github.com/NVIDIA/nvbench/blob/2ce4e425eeaf7453ee10ead99f6408d41c733813/nvbench/state.cuh#L217-L226 which is checked in the here: https://github.com/NVIDIA/nvbench/blob/2ce4e425eeaf7453ee10ead99f6408d41c733813/nvbench/detail/state_exec.cuh#L125-L127 only if run_once is not enabled. And run_once is enabled when --profile is specified.

I ran the throughput benchmark with --profile and did not see any CUPTI calls. Perhaps there is something else that I'm missing?

jrhemstad commented 2 years ago

Hm, it could be simply even linking with cupti could cause the incompatibility with GPU metric collection in Nsight.

davidwendt commented 2 years ago

Hm, it could be simply even linking with cupti could cause the incompatibility with GPU metric collection in Nsight.

There is an NVBENCH_HAS_CUPTI compile flag so I could try out that theory.

PointKernel commented 2 years ago

I ran the throughput benchmark with --profile and did not see any CUPTI calls.

By all means, we would expect is_cupti_required() returning false when --profile is present. Looks like when --profile is used, CUPTI APIs won't be called regardless of the return value of is_cupti_required().

To verify this, I printf the return value of is_cupti_required() at the beginning of state::exec(). When testing with auto_throughput, I did see is_cupti_required() still returning true when I run the benchmark with --profile flag


(base) yunsongw@yunsongw-dt:~/Work/nvbench/build$ ./bin/nvbench.example.auto_throughput --profile
# Devices

## [0] `Quadro RTX 8000`
* SM Version: 750 (PTX Version: 750)
* Number of SMs: 72
* SM Default Clock Rate: 1770 MHz
* Global Memory: 48410 MiB Free / 48601 MiB Total
* Global Memory Bus Peak: 672 GB/sec (384-bit DDR @7001MHz)
* Max Shared Memory: 64 KiB/SM, 48 KiB/Block
* L2 Cache Size: 6144 KiB
* Maximum Active Blocks: 16/SM
* Maximum Active Threads: 1024/SM, 1024/Block
* Available Registers: 65536/SM, 65536/Block
* ECC Enabled: No

# Log

Run:  [1/4] throughput_bench [Device=0 T=1 Stride=1]
### is_cupti_required: 1
### is_cupti_required: 1
### is_cupti_required: 1
### cold is_cupti_required: 1
Pass: Cold: 0.487680ms GPU, 0.493811ms CPU, 0.00s total GPU, 0.00s total wall, 1x 
Run:  [2/4] throughput_bench [Device=0 T=1 Stride=4]
### is_cupti_required: 1
### is_cupti_required: 1
### is_cupti_required: 1
### cold is_cupti_required: 1
Pass: Cold: 1.232896ms GPU, 1.237910ms CPU, 0.00s total GPU, 0.00s total wall, 1x 
Run:  [3/4] throughput_bench [Device=0 T=2 Stride=1]
### is_cupti_required: 1
### is_cupti_required: 1
### is_cupti_required: 1
### cold is_cupti_required: 1
Pass: Cold: 1.026048ms GPU, 1.031796ms CPU, 0.00s total GPU, 0.00s total wall, 1x 
Run:  [4/4] throughput_bench [Device=0 T=2 Stride=4]
### is_cupti_required: 1
### is_cupti_required: 1
### is_cupti_required: 1
### cold is_cupti_required: 1
Pass: Cold: 2.453760ms GPU, 2.459609ms CPU, 0.00s total GPU, 0.00s total wall, 1x 
davidwendt commented 2 years ago

Finally, I found that all the CUPTI calls in nvbench are encapsulated in https://github.com/NVIDIA/nvbench/blob/main/nvbench/cupti_profiler.cxx where all the API calls are checked by

https://github.com/NVIDIA/nvbench/blob/2ce4e425eeaf7453ee10ead99f6408d41c733813/nvbench/cupti_profiler.cxx#L41

I added a printf into this function and it never printed when using --profile

Perhaps you can try the above for printing out if CUPTI APIs are being called? They are not being called for me when --profile is enabled.

PointKernel commented 2 years ago

They are not being called for me when --profile is enabled.

Yeah, you are right. Though is_cupti_required returns true, CUPTI APIs are not called when --profile is used.

davidwendt commented 2 years ago

Hm, it could be simply even linking with cupti could cause the incompatibility with GPU metric collection in Nsight.

There is an NVBENCH_HAS_CUPTI compile flag so I could try out that theory.

I rebuilt the STRINGS_NVBENCH with the CUPTI disabled (in cmake) which causes it to not even link to libcupti.so. Still the nsys results include missing/inconsistent data with CUPTI completely out of the picture. So it seems only the nsys --gpu-metrics-frequency=100 fixes this problem.

GregoryKimball commented 2 years ago

Thank you @PointKernel and @davidwendt for investigating the missing/inconsistent GPU metrics data with nvbench. If it's not the CUPTI dependency, what is the root cause of the problem? Lowering the GPU metrics frequency is not an adequate workaround, because most functions take <10 ms. Also even with the 100 Hz GPU metrics frequency setting, I see the same error for STRINGS_NVBENCH on my A100 instance running NsightSystems 2022.3.4.35-5490857:

Events fetch failed: Source ID=Type=ErrorInformation (18) Error information: ProcessEventsError (4005)  Properties:  ErrorText (100)=/build/agent/work/323cb361ab84164c/QuadD/Host/Analysis/EventHandler/GpuMetricsEventHandler.cpp(202): Throw in function void QuadDAnalysis::EventHandler::GpuMetricsEventHandler::PutEvent(QuadDAnalysis::EventHandler::GpuMetricsEventHandler::EventPtr)Dynamic exception type: boost::wrapexceptstd::exception::what: ChronologicalOrderError[QuadDCommon::tag_message*] = GPU Metrics event chronological order was broken.

What other parts of nvbench could be causing the missing/inconsistent data?