Open jrhemstad opened 2 years ago
I looked into this a bit and was not able to find CUPTI calls occurring when the --profile
flag is used.
Analysis started here:
https://github.com/NVIDIA/nvbench/blob/2ce4e425eeaf7453ee10ead99f6408d41c733813/nvbench/option_parser.cu#L427-L430
Where the --profile
flag is used to set the run_once
and disable_blocking_kernel
state flags.
I then traced the logic to here:
https://github.com/NVIDIA/nvbench/blob/2ce4e425eeaf7453ee10ead99f6408d41c733813/nvbench/detail/state_exec.cuh#L119-L139
Note that even if is_cupti_required()
is true
that this path will not execute if run_once
is enabled.
Further tracing into nvbench::detail::measure_cold
did not show any CUPTI calls also if run_once
is enabled.
Finally, I found that all the CUPTI calls in nvbench are encapsulated in https://github.com/NVIDIA/nvbench/blob/main/nvbench/cupti_profiler.cxx
where all the API calls are checked by https://github.com/NVIDIA/nvbench/blob/2ce4e425eeaf7453ee10ead99f6408d41c733813/nvbench/cupti_profiler.cxx#L41
I added a printf
into this function and it never printed when using --profile
Of course, I may have missed something and would welcome any feedback on the above.
IIUC, CUPTI will be used only if any of these auto throughput measurements are required. https://github.com/NVIDIA/nvbench/blob/5d70492714d05f2207e2193be8a8cc0a85eefc76/examples/auto_throughput.cu#L64-L68
We need to explicitly set the below bool
s to false
when --profile
is present
https://github.com/NVIDIA/nvbench/blob/1a13a2e724b8aa8aee27649ac6878babb63862a6/nvbench/state.cuh#L288-L292
@PointKernel is correct. CUPTI collection will only occur when a benchmark explicitly opts in via the collect_dram_throughput()
, etc.
The --profile
flag should override a benchmark that uses collect_dram_throughput()
.
In fact, it may make sense to just disable any output all together when using --profile
. My intuition is that someone using --profile
doesn't care about the output from nvbench.
These are all wrapped by a single is_cupti_required()
call
https://github.com/NVIDIA/nvbench/blob/2ce4e425eeaf7453ee10ead99f6408d41c733813/nvbench/state.cuh#L217-L226
which is checked in the here: https://github.com/NVIDIA/nvbench/blob/2ce4e425eeaf7453ee10ead99f6408d41c733813/nvbench/detail/state_exec.cuh#L125-L127
only if run_once
is not enabled. And run_once
is enabled when --profile
is specified.
I ran the throughput benchmark with --profile
and did not see any CUPTI calls.
Perhaps there is something else that I'm missing?
Hm, it could be simply even linking with cupti
could cause the incompatibility with GPU metric collection in Nsight.
Hm, it could be simply even linking with
cupti
could cause the incompatibility with GPU metric collection in Nsight.
There is an NVBENCH_HAS_CUPTI
compile flag so I could try out that theory.
I ran the throughput benchmark with
--profile
and did not see any CUPTI calls.
By all means, we would expect
Looks like when is_cupti_required()
returning false
when --profile
is present.--profile
is used, CUPTI APIs won't be called regardless of the return value of is_cupti_required()
.
To verify this, I printf the return value of is_cupti_required()
at the beginning of state::exec(). When testing with auto_throughput, I did see is_cupti_required()
still returning true
when I run the benchmark with --profile
flag
(base) yunsongw@yunsongw-dt:~/Work/nvbench/build$ ./bin/nvbench.example.auto_throughput --profile
# Devices
## [0] `Quadro RTX 8000`
* SM Version: 750 (PTX Version: 750)
* Number of SMs: 72
* SM Default Clock Rate: 1770 MHz
* Global Memory: 48410 MiB Free / 48601 MiB Total
* Global Memory Bus Peak: 672 GB/sec (384-bit DDR @7001MHz)
* Max Shared Memory: 64 KiB/SM, 48 KiB/Block
* L2 Cache Size: 6144 KiB
* Maximum Active Blocks: 16/SM
* Maximum Active Threads: 1024/SM, 1024/Block
* Available Registers: 65536/SM, 65536/Block
* ECC Enabled: No
# Log
Run: [1/4] throughput_bench [Device=0 T=1 Stride=1]
### is_cupti_required: 1
### is_cupti_required: 1
### is_cupti_required: 1
### cold is_cupti_required: 1
Pass: Cold: 0.487680ms GPU, 0.493811ms CPU, 0.00s total GPU, 0.00s total wall, 1x
Run: [2/4] throughput_bench [Device=0 T=1 Stride=4]
### is_cupti_required: 1
### is_cupti_required: 1
### is_cupti_required: 1
### cold is_cupti_required: 1
Pass: Cold: 1.232896ms GPU, 1.237910ms CPU, 0.00s total GPU, 0.00s total wall, 1x
Run: [3/4] throughput_bench [Device=0 T=2 Stride=1]
### is_cupti_required: 1
### is_cupti_required: 1
### is_cupti_required: 1
### cold is_cupti_required: 1
Pass: Cold: 1.026048ms GPU, 1.031796ms CPU, 0.00s total GPU, 0.00s total wall, 1x
Run: [4/4] throughput_bench [Device=0 T=2 Stride=4]
### is_cupti_required: 1
### is_cupti_required: 1
### is_cupti_required: 1
### cold is_cupti_required: 1
Pass: Cold: 2.453760ms GPU, 2.459609ms CPU, 0.00s total GPU, 0.00s total wall, 1x
Finally, I found that all the CUPTI calls in nvbench are encapsulated in https://github.com/NVIDIA/nvbench/blob/main/nvbench/cupti_profiler.cxx where all the API calls are checked by
I added a
printf
into this function and it never printed when using--profile
Perhaps you can try the above for printing out if CUPTI APIs are being called?
They are not being called for me when --profile
is enabled.
They are not being called for me when --profile is enabled.
Yeah, you are right. Though is_cupti_required
returns true
, CUPTI APIs are not called when --profile
is used.
Hm, it could be simply even linking with
cupti
could cause the incompatibility with GPU metric collection in Nsight.There is an
NVBENCH_HAS_CUPTI
compile flag so I could try out that theory.
I rebuilt the STRINGS_NVBENCH
with the CUPTI disabled (in cmake) which causes it to not even link to libcupti.so
.
Still the nsys
results include missing/inconsistent data with CUPTI completely out of the picture.
So it seems only the nsys
--gpu-metrics-frequency=100
fixes this problem.
Thank you @PointKernel and @davidwendt for investigating the missing/inconsistent GPU metrics data with nvbench. If it's not the CUPTI dependency, what is the root cause of the problem? Lowering the GPU metrics frequency is not an adequate workaround, because most functions take <10 ms. Also even with the 100 Hz GPU metrics frequency setting, I see the same error for STRINGS_NVBENCH
on my A100 instance running NsightSystems 2022.3.4.35-5490857
:
Events fetch failed: Source ID=Type=ErrorInformation (18) Error information: ProcessEventsError (4005) Properties: ErrorText (100)=/build/agent/work/323cb361ab84164c/QuadD/Host/Analysis/EventHandler/GpuMetricsEventHandler.cpp(202): Throw in function void QuadDAnalysis::EventHandler::GpuMetricsEventHandler::PutEvent(QuadDAnalysis::EventHandler::GpuMetricsEventHandler::EventPtr)Dynamic exception type: boost::wrapexceptstd::exception::what: ChronologicalOrderError[QuadDCommon::tag_message*] = GPU Metrics event chronological order was broken.
What other parts of nvbench could be causing the missing/inconsistent data?
The
--profile
flag executes a benchmark a single time to enable profiling a benchmark with tools like Nsight Systems and Nsight Compute.These tools are incompatible with concurrent use of CUPTI:
Therefore, the
--profile
flag should also disable any use of CUPTI for gathering metrics.