Closed syifan closed 2 months ago
Are you using release or dev branch? Can you try dev?
I am using release. Let me try dev.
Junrui, can you see if you can reproduce this on release? We may need a push to release.
On Fri, Sep 6, 2024 at 2:36 PM Yifan Sun @.***> wrote:
I am using release. Let me try dev.
— Reply to this email directly, view it on GitHub https://github.com/accel-sim/accel-sim-framework/issues/333#issuecomment-2334614957, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7UY4IGDIWQ6EGMJOVNYGTZVHYZBAVCNFSM6AAAAABNYXS63WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZUGYYTIOJVG4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>
We got the same behavior for the dev branch. Please see the screenshot below. We added a print Work 1
just to help us identify where it stops. Should we modify any parameters?
The nvbit version is still 1.5.5 latest dev uses 1.7
Can you please try delete nvbit_release
folder, make clean, and try again with dev?
OK. We have upgraded to the most recent version, but the result is still the same.
we tried on V100 and A30, CUDA 11.7 and CUDA 12.2. Both run fine.
pan251@tgrogers-gpu01:tracer_nvbit$ LD_PRELOAD=./tracer_tool/tracer_tool.so ./nvbit_release/test-apps/vectoradd/vectoradd
------------- NVBit (NVidia Binary Instrumentation Tool v1.7) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
ACK_CTX_INIT_LIMITATION = 0 - if set, no warning will be printed for nvbit_at_ctx_init()
NVDISASM = nvdisasm - override default nvdisasm found in PATH
NOBANNER = 0 - if set, does not print this banner
NO_EAGER_LOAD = 0 - eager module loading is turned on by NVBit to prevent potential NVBit tool deadlock, turn it off if you want to use the lazy module loading feature
---------------------------------------------------------------------------------
INSTR_BEGIN = 0 - Beginning of the instruction interval where to apply instrumentation
INSTR_END = 4294967295 - End of the instruction interval where to apply instrumentation
EXCLUDE_PRED_OFF = 1 - Exclude predicated off instruction from count
TRACE_LINEINFO = 0 - Include source code line info at the start of each traced line. The target binary must be compiled with -lineinfo or --generate-line-info
DYNAMIC_KERNEL_LIMIT_END = 0 - Limit of the number kernel to be printed, 0 means no limit
DYNAMIC_KERNEL_LIMIT_START = 0 - start to report kernel from this kernel id, 0 means starts from the beginning, i.e. first kernel
ACTIVE_FROM_START = 1 - Start instruction tracing from start or wait for cuProfilerStart and cuProfilerStop. If set to 0, DYNAMIC_KERNEL_LIMIT options have no effect
TOOL_VERBOSE = 0 - Enable verbosity inside the tool
TOOL_COMPRESS = 1 - Enable traces compression
TOOL_TRACE_CORE = 0 - write the core id in the traces
TERMINATE_UPON_LIMIT = 0 - Stop the process once the current kernel > DYNAMIC_KERNEL_LIMIT_END
USER_DEFINED_FOLDERS = 0 - Uses the user defined folder TRACES_FOLDER path environment
TRACE_FILE_COMPRESS = 0 - Create xz-compressed tracefile
----------------------------------------------------------------------------------------------------
WARNING: Do not call CUDA memory allocation in nvbit_at_ctx_init(). It will cause deadlocks. Do them in nvbit_tool_init(). If you encounter deadlocks, remove CUDA API calls to debug.
Writing results to /home/tgrogers-raid/a/pan251/accel-sim-framework-dev/util/tracer_nvbit/traces//kernel-1.trace
Final sum = 100000.000000; sum/n = 1.000000 (should be ~1)
I'm not sure about the cause right now. What's your OS version, CUDA, and driver version? I'll try to reproduce it.
OK. The problem is now solved. The secret is to use the dev branch. But by switching to the dev branch, we need to believe the nvbit_release
directory, run make clean
, and make again.
Thanks for your help!
I am trying to run the tracing tool, but the program hangs.
I am testing on an NVIDIA A100 PCIe 80GB. I am using this command
LD_PRELOAD=./tracer_tool/tracer_tool.so ./nvbit_release/test-apps/vectoradd/vectoradd
as directed in the instructions.Using GDB, I can see that the program hangs here https://github.com/accel-sim/accel-sim-framework/blob/release/util/tracer_nvbit/tracer_tool/tracer_tool.cu#L673. I cannot trace further since it seems the program is not getting into the
channel_host.init
method, but got stuck in_cudaInitModule
Anyone is facing the same issue? Any suggestions on how to solve the problem?