NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
788 stars 228 forks source link

[FEA] Simplified Nsight tracing #10632

Open jlowe opened 6 months ago

jlowe commented 6 months ago

Is your feature request related to a problem? Please describe. It's currently complicated to setup and collect an Nsight Systems trace of one or more executors, especially in non-standalone environments. There needs to be a simpler solution so users can collect these traces easily.

Describe the solution you'd like A new config flag, e.g.: spark.rapids.nsight.tracePrefix, that specifies a URI prefix where Nsight traces will be stored. If this config is set, it indicates that the user wants tracing to be enabled on all executors. The Nsight tracing libraries would be included in the jar and leveraged by the executors, before the CUDA context is established, to enable tracing. On executor shutdown, the tracing would be stopped, collected, and uploaded to the URI prefix with some unique ID appended to the prefix (e.g.: application ID and executor ID). Ideally the trace data is already a qdrep file ready to be loaded into the Nsight Systems viewer. A message should be sent back to the driver once the data is written so the driver can log where each executor placed its trace file.

A separate config, e.g.: spark.rapids.nsight.executor, could be used to limit which executor(s) are traced. For example, this could be a comma-separated and/or range-dashed list of executor IDs where only those executors will capture a trace. For example, 0,2-5,10 would capture traces only on executors 0, 2, 3, 4, 5, and 10. "all" or leaving the config unset would trace all executors. Or maybe we should just trace executor 0 by default, and let the user set this to "all" if they really want all executors traced.

Describe alternatives you've considered If the libraries for tracing are too large to be included in the RAPIDS Accelerator jar by default, we could have a separate jar that is used for tracing.

jlowe commented 6 months ago

After chatting with the Nsight Systems team, we can probably accomplish most of the tracing needs we want by leveraging the cupti toolkit. This won't generate a qdrep file, but it might be easy to post-process the cupti trace data such that we could generate the qdrep file from it.