accel-sim / accel-sim-framework

This is the top-level repository for the Accel-Sim framework.
https://accel-sim.github.io
Other
273 stars 105 forks source link

WARNING: this function has not been implemented yet.WARNING: this function has not been implemented yet.ERROR file nvbit_imp.cpp line 358: @_X Ht[_hHt #309

Closed macto94 closed 1 month ago

macto94 commented 1 month ago

I am trying to generate traces for my own applications. I've just followed instructions in readme.

export CUDA_INSTALL_PATH=/usr/local/cuda-11.4
export PATH=$CUDA_INSTALL_PATH/bin:$PATH  
./util/tracer_nvbit/install_nvbit.sh
make -C ./util/tracer_nvbit/
LD_PRELOAD=./tracer_tool/tracer_tool.so ../../gpu-simulator/gpgpu-sim/examples/sim_mp_kernels/src/async

but it doesn't work with following log:

root@a8da1a446c6c:~/accel-sim-framework/util/tracer_nvbit# LD_PRELOAD=./tracer_tool/tracer_tool.so ../../gpu-simulator/gpgpu-sim/examples/sim_mp_kernels/src/async
------------- NVBit (NVidia Binary Instrumentation Tool v1.5.5) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
            NVDISASM = nvdisasm - override default nvdisasm found in PATH
            NOBANNER = 0 - if set, does not print this banner
---------------------------------------------------------------------------------
         INSTR_BEGIN = 0 - Beginning of the instruction interval where to apply instrumentation
           INSTR_END = 4294967295 - End of the instruction interval where to apply instrumentation
    EXCLUDE_PRED_OFF = 1 - Exclude predicated off instruction from count
      TRACE_LINEINFO = 0 - Include source code line info at the start of each traced line. The target binary must be compiled with -lineinfo or --generate-line-info
DYNAMIC_KERNEL_LIMIT_END = 0 - Limit of the number kernel to be printed, 0 means no limit
DYNAMIC_KERNEL_LIMIT_START = 0 - start to report kernel from this kernel id, 0 means starts from the beginning, i.e. first kernel
   ACTIVE_FROM_START = 1 - Start instruction tracing from start or wait for cuProfilerStart and cuProfilerStop. If set to 0, DYNAMIC_KERNEL_LIMIT options have no effect
        TOOL_VERBOSE = 0 - Enable verbosity inside the tool
       TOOL_COMPRESS = 1 - Enable traces compression
     TOOL_TRACE_CORE = 0 - write the core id in the traces
TERMINATE_UPON_LIMIT = 0 - Stop the process once the current kernel > DYNAMIC_KERNEL_LIMIT_END
USER_DEFINED_FOLDERS = 0 - Uses the user defined folder TRACES_FOLDER path environment
 TRACE_FILE_COMPRESS = 0 - Create xz-compressed tracefile
----------------------------------------------------------------------------------------------------
cudaGetExportTable: UUID = 0x6e 0x16 0x3f 0xbe 0xb9 0x58 0x44 0x4d 0x83 0x5c 0xe1 0x82 0xaf 0xf1 0x99 0x1e
cudaGetExportTable: UUID = 0x35 0x77 0xf 0x1b 0x9 0x2e 0x3 0x48 0xa4 0x8e 0x5 0x6f 0xc4 0x23 0x96 0x8d
cudaGetExportTable: UUID = 0xbf 0xdb 0x43 0x2d 0xbf 0x3c 0x5a 0x4a 0x94 0x5e 0xb3 0x40 0x29 0xe8 0x1e 0x75
cudaGetExportTable: UUID = 0x21 0x31 0x8c 0x60 0x97 0x14 0x32 0x48 0x8c 0xa6 0x41 0xff 0x73 0x24 0xc8 0xf2
cudaGetExportTable: UUID = 0x42 0xd8 0x5a 0x81 0x23 0xf6 0xcb 0x47 0x82 0x98 0xf6 0xe7 0x8a 0x3a 0xec 0xdc
cudaGetExportTable: UUID = 0xb1 0x5 0x41 0xe1 0xf7 0xc7 0xc7 0x4a 0x9f 0x64 0xf2 0x23 0xbe 0x99 0xf1 0xe2
cudaGetExportTable: UUID = 0xa6 0xb1 0xff 0x99 0xec 0xc4 0xc9 0x4f 0x92 0xf9 0x19 0x28 0x66 0x3d 0x55 0x85
cudaGetExportTable: UUID = 0xf8 0x8c 0xc9 0x3e 0x53 0xfd 0x9e 0x46 0xba 0x59 0x1e 0x2b 0x87 0x3e 0xf 0x91
WARNING: this function has not been implemented yet.WARNING: this function has not been implemented yet.ERROR file nvbit_imp.cpp line 358: @_X Ht[ÐhHt

my kernel is simple gemm example using asynchronous copy.

I tried to run the simulation in PTX mode instead of trace-driven mode. However, I encountered the following error message. So, I thought that asynchronous copy might not be implemented in gpgpu-sim. I then tried trace-driven mode, but similarly, the trace was not generated as above. Is CUDA's asynchronous copy not implemented?

Extracting specific PTX file named async.1.sm_80.ptx
Extracting PTX file and ptxas options    1: async.1.sm_80.ptx -arch=sm_80
GPGPU-Sim PTX: __cudaRegisterFunction _Z15wmmaAsyncKernelPK6__halfS1_PS_mmm : hostFun 0x0x63e77f8026ab, fat_cubin_handle = 1
GPGPU-Sim PTX: Parsing async.1.sm_80.ptx
GPGPU-Sim PTX: allocating shared region for "smem" from 0x0 to 0x0 (shared memory space)
async.1.sm_80.ptx:322 Syntax error:

        cp.async.cg.shared.global.L2::128B [%r20], [%rd43], 16;
          ^

GPGPU-Sim PTX: finished parsing EMBEDDED .ptx file async.1.sm_80.ptx
cesar-avalos3 commented 1 month ago

This error usually happens when you are trying to trace the simulation and not the hardware. Try generating the traces from a clean environment (not running any setup_environments).

macto94 commented 1 month ago

Thank you @cesar-avalos3 Is it right that cp.async is not implemented in PTX-driven mode? I feel like,, PTX mode is more convenient for me to use.

cesar-avalos3 commented 1 month ago

I think unfortunately most of the Turing+ instructions are not yet supported by gpgpu-sim (PTX mode).

macto94 commented 1 month ago

Okay thank u for the quick response

macto94 commented 1 month ago

@cesar-avalos3 Sorry for bothering you. Thanks to your help, I'm now able to simulate my kernel using the trace-driven method, but I have a question. I want to investigate the impact of increasing the shared memory size in the adaptive cache. (I'm using a generated A100 config.)

The gpgpusim.config is as follows. My initial approach was to simply increase the l1d_size, shmem_size, and shmem_per_block by 1.5 times. However, the simulator log says

GPGPU-Sim: Reconfigure L1 cache to 124KB.
GPGPU-Sim uArch: ERROR ** deadlock detected: last writeback core 0 @ gpu_sim_cycle 6110 (+ gpu_tot_sim_cycle 4294867296) (93890 cycles ago)
GPGPU-Sim uArch: DEADLOCK  shader cores no longer committing instructions [core(# threads)]:
GPGPU-Sim uArch: DEADLOCK  0(256)

I thought it should be 42KB, considering 288 (192 1.5) - 246 (164 1.5). So,, it seems that the changes I made to the config are not being applied properly.

Could you advise on how to approach the study of kernel performance differences when varying the shared memory size? how should I modify the config?

-gpgpu_adaptive_cache_config 1
-gpgpu_shmem_option 0,8,16,32,64,164
-gpgpu_unified_l1d_size 192
# L1 cache configuration
-gpgpu_l1_banks 4
-gpgpu_cache:dl1 S:4:128:64,L:T:m:L:L,A:512:64,16:0,32
-gpgpu_l1_latency 20
-gpgpu_gmem_skip_L1D 0
-gpgpu_flush_l1_cache 1
-gpgpu_n_cluster_ejection_buffer_size 32
-gpgpu_l1_cache_write_ratio 25

# shared memory  configuration
-gpgpu_shmem_size 167936
-gpgpu_shmem_sizeDefault 167936
-gpgpu_shmem_per_block 49152
#-gpgpu_shmem_per_block 98304
-gpgpu_smem_latency 20
# shared memory bankconflict detection
-gpgpu_shmem_num_banks 32
-gpgpu_shmem_limited_broadcast 0
-gpgpu_shmem_warp_parts 1
-gpgpu_coalesce_arch 80
JRPan commented 1 month ago

I don't think you can explore that by simply modifying the configuration. The instruction executed by shared memory load is different from global load.

Also, shared memory is managed. Simply increasing shared memory won't increase shared memory consumption. You need to rewrite the kernel and load in shared memory.

macto94 commented 1 month ago

@JRPan Hmm,, sorry, I don't quite understand what you're saying. Do you mean that I can't set the size of shared memory through the config? Or are you suggesting that I should forcibly increase the shared memory size that my kernel uses?

When implementing a kernel, there are situations where excessive use of shared memory limits the number of thread blocks that can be active simultaneously due to hardware limitations. For example, if an SM can have up to 164KB of shared memory, but an extreme case uses 80KB for a shared memory array, only 2 thread blocks can be allocated simultaneously.

In such situations, I want to investigate things like, "How would performance change if the SM supported up to 320KB of shared memory?". I've thought Accel-Sim could provide a solution to this kind of problem. Am I thinking wrong?

JRPan commented 1 month ago

Understood. I was suggesting that the execution time of each warp should be the same, not related to the SMEM size. But you are correct that you can launch more warps if SMEM is the occupancy limiter.

shmem_per_block is not used in trace-driven.

change gpgpu_shmem_size, gpgpu_shmem_option, and gpgpu_unified_l1d_size. gpgpu_unified_l1d_size is the total L1D/SMEM (unified L1 cache). L1D + SMEM = gpgpu_unified_l1d_size gpgpu_shmem_option is the list of SMEM sizes to choose from. The simulator will choose one based on the kernel usage. gpgpu_shmem_size is the max SMEM of each SM. This is used to calculate occupancy.

largest gpgpu_shmem_option should equal gpgpu_shmem_size, and smaller than gpgpu_unified_l1d_size.

Please let me know if you have any other questions.

macto94 commented 1 month ago

@JRPan Oh, I see now. I was curious about why shmem_per_block was not affecting performance. After properly adjusting the shmem_option, I confirmed that the changes are applied. If I want to increase the unified_l1d_size, can I just increase gpgpu_unified_l1d_size, or do I also need to change elements like and in the L1 cache configuration?

Thank you for the perfect answer !

JRPan commented 1 month ago

That part of the code is pretty messy. gpgpu_unified_l1d_size defines the max L1D size. -gpgpu_cache:dl1 S:4:128:64,L:T:m:L:L,A:512:8,16:0,32 is set = 4, cacheline size = 128, assoc = 64. L1D size is 412864 = 32KB. Then there is a multiplier calculated with gpgpu_unified_l1d_size. The assoc is changed dynamically based on SMEM usage.

changing gpgpu_unified_l1d_size should work. But if any assertion fails, check what the error is and update accordingly and it should just work.

macto94 commented 1 month ago

@JRPan @cesar-avalos3 Your answers are incredibly helpful. Thank you so much !