Closed macto94 closed 5 months ago
This error usually happens when you are trying to trace the simulation and not the hardware. Try generating the traces from a clean environment (not running any setup_environments).
Thank you @cesar-avalos3
Is it right that cp.async
is not implemented in PTX-driven mode? I feel like,, PTX mode is more convenient for me to use.
I think unfortunately most of the Turing+ instructions are not yet supported by gpgpu-sim (PTX mode).
Okay thank u for the quick response
@cesar-avalos3 Sorry for bothering you. Thanks to your help, I'm now able to simulate my kernel using the trace-driven method, but I have a question. I want to investigate the impact of increasing the shared memory size in the adaptive cache. (I'm using a generated A100 config.)
The gpgpusim.config
is as follows. My initial approach was to simply increase the l1d_size, shmem_size, and shmem_per_block
by 1.5 times. However, the simulator log says
GPGPU-Sim: Reconfigure L1 cache to 124KB.
GPGPU-Sim uArch: ERROR ** deadlock detected: last writeback core 0 @ gpu_sim_cycle 6110 (+ gpu_tot_sim_cycle 4294867296) (93890 cycles ago)
GPGPU-Sim uArch: DEADLOCK shader cores no longer committing instructions [core(# threads)]:
GPGPU-Sim uArch: DEADLOCK 0(256)
I thought it should be 42KB, considering 288 (192 1.5) - 246 (164 1.5). So,, it seems that the changes I made to the config are not being applied properly.
Could you advise on how to approach the study of kernel performance differences when varying the shared memory size? how should I modify the config?
-gpgpu_adaptive_cache_config 1
-gpgpu_shmem_option 0,8,16,32,64,164
-gpgpu_unified_l1d_size 192
# L1 cache configuration
-gpgpu_l1_banks 4
-gpgpu_cache:dl1 S:4:128:64,L:T:m:L:L,A:512:64,16:0,32
-gpgpu_l1_latency 20
-gpgpu_gmem_skip_L1D 0
-gpgpu_flush_l1_cache 1
-gpgpu_n_cluster_ejection_buffer_size 32
-gpgpu_l1_cache_write_ratio 25
# shared memory configuration
-gpgpu_shmem_size 167936
-gpgpu_shmem_sizeDefault 167936
-gpgpu_shmem_per_block 49152
#-gpgpu_shmem_per_block 98304
-gpgpu_smem_latency 20
# shared memory bankconflict detection
-gpgpu_shmem_num_banks 32
-gpgpu_shmem_limited_broadcast 0
-gpgpu_shmem_warp_parts 1
-gpgpu_coalesce_arch 80
I don't think you can explore that by simply modifying the configuration. The instruction executed by shared memory load is different from global load.
Also, shared memory is managed. Simply increasing shared memory won't increase shared memory consumption. You need to rewrite the kernel and load in shared memory.
@JRPan Hmm,, sorry, I don't quite understand what you're saying. Do you mean that I can't set the size of shared memory through the config? Or are you suggesting that I should forcibly increase the shared memory size that my kernel uses?
When implementing a kernel, there are situations where excessive use of shared memory limits the number of thread blocks that can be active simultaneously due to hardware limitations. For example, if an SM can have up to 164KB of shared memory, but an extreme case uses 80KB for a shared memory array, only 2 thread blocks can be allocated simultaneously.
In such situations, I want to investigate things like, "How would performance change if the SM supported up to 320KB of shared memory?". I've thought Accel-Sim could provide a solution to this kind of problem. Am I thinking wrong?
Understood. I was suggesting that the execution time of each warp should be the same, not related to the SMEM size. But you are correct that you can launch more warps if SMEM is the occupancy limiter.
shmem_per_block
is not used in trace-driven.
change gpgpu_shmem_size
, gpgpu_shmem_option
, and gpgpu_unified_l1d_size
.
gpgpu_unified_l1d_size
is the total L1D/SMEM (unified L1 cache). L1D + SMEM = gpgpu_unified_l1d_size
gpgpu_shmem_option
is the list of SMEM sizes to choose from. The simulator will choose one based on the kernel usage.
gpgpu_shmem_size
is the max SMEM of each SM. This is used to calculate occupancy.
largest gpgpu_shmem_option
should equal gpgpu_shmem_size
, and smaller than gpgpu_unified_l1d_size
.
Please let me know if you have any other questions.
@JRPan Oh, I see now. I was curious about why shmem_per_block
was not affecting performance. After properly adjusting the shmem_option
, I confirmed that the changes are applied.
If I want to increase the unified_l1d_size, can I just increase gpgpu_unified_l1d_size
, or do I also need to change elements like
Thank you for the perfect answer !
That part of the code is pretty messy. gpgpu_unified_l1d_size
defines the max L1D size.
-gpgpu_cache:dl1 S:4:128:64,L:T:m:L:L,A:512:8,16:0,32
is set = 4, cacheline size = 128, assoc = 64. L1D size is 412864 = 32KB.
Then there is a multiplier calculated with gpgpu_unified_l1d_size
. The assoc is changed dynamically based on SMEM usage.
changing gpgpu_unified_l1d_size
should work. But if any assertion fails, check what the error is and update accordingly and it should just work.
@JRPan @cesar-avalos3 Your answers are incredibly helpful. Thank you so much !
I am trying to generate traces for my own applications. I've just followed instructions in readme.
but it doesn't work with following log:
my kernel is simple gemm example using asynchronous copy.
I tried to run the simulation in PTX mode instead of trace-driven mode. However, I encountered the following error message. So, I thought that asynchronous copy might not be implemented in gpgpu-sim. I then tried trace-driven mode, but similarly, the trace was not generated as above. Is CUDA's asynchronous copy not implemented?