accel-sim / accel-sim-framework

This is the top-level repository for the Accel-Sim framework.
https://accel-sim.github.io
Other
273 stars 105 forks source link

Rodinia simulation config #318

Open Leon924 opened 2 weeks ago

Leon924 commented 2 weeks ago

Hi, accel-sim devolopers:

$ nvidia-smi Tue Jul 9 16:45:54 2024
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 TITAN RTX Off | 00000000:3B:00.0 Off | N/A | | 44% 41C P0 58W / 280W | 0MiB / 24220MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 TITAN RTX Off | 00000000:5E:00.0 Off | N/A | | 49% 39C P0 57W / 280W | 0MiB / 24220MiB | 1% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 TITAN RTX Off | 00000000:B1:00.0 Off | N/A | | 38% 36C P0 63W / 280W | 0MiB / 24220MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 TITAN RTX Off | 00000000:D9:00.0 Off | N/A | | 22% 36C P0 39W / 280W | 0MiB / 24220MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

/home/data/userhome/liqiang/lab/gpu/accel-sim-framework/util/job_launching/configs/define-standard-cfgs.yml

Basefile Configs

Pascal

TITANX: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM6_TITANX/gpgpusim.config"

TITANXX: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/TITANX-pascal/gpgpusim.config"

Kepler

TITANK: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM3_KEPLER_TITAN/gpgpusim.config"

Ampere RTX 3070

RTX3070: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM86_RTX3070/gpgpusim.config"

Turing

RTX2060: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM75_RTX2060/gpgpusim.config"

Turing

RTX2060_S: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM75_RTX2060_S/gpgpusim.config"

Volta

TITANV: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM7_TITANV/gpgpusim.config"

Volta

TITANV_OLD: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM7_TITANV_OLD/gpgpusim.config"

QV100: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM7_QV100/gpgpusim.config"

QV100_64SM: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM7_QV100_SMs/gpgpusim.config"

QV100_SASS: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM7_QV100_SASS/gpgpusim.config"

QV100_old: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM7_QV100_old/gpgpusim.config"

Fermi

GTX480: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM2_GTX480/gpgpusim.config"

To keep your configurations straight - we recommend specifying

If you are using SASS or PTX in the config:

For example: QV100-SASS or QV100-PTX.

SASS: extra_params: "#SASS-Driven Accel-Sim"

PTX: extra_params: "#PTX-Driven GPGPU-Sim"

JRPan commented 2 weeks ago

We don't have one for TITAN RTX. You can change the 2060 config. Change the number of SMs and clock frequency to match your card. SM-related stuff should be similar. Change L2 size as well.

Leon924 commented 2 weeks ago

Thanks. I have tried to generate an CONFIG file in acclesim-tuner, here is my turing_TITANRTX_hw_def.h file. I changed frequency, L1cache size and WARP_SCHEDS_PER_SM from turing_rtx2060 header file.


  // TITAN RTX HW def file
  // based on TU102
  #ifndef TURING_TITANRTX_DEF_H
  #define TURING_TITANRTX_DEF_H

  #include "./common/common.h"
  #include "./common/deviceQuery.h"

  #define L1_SIZE (96 * 1024) // Max L1 size in bytes, NVIDIA-Turing-Architecture-Whitepaper page13

  #define CLK_FREQUENCY 1350 // Base frequency in MHz (Boost clock can go up to 1770 MHz)

  #define ISSUE_MODEL issue_model::single   // single issue core or dual issue
  #define CORE_MODEL core_model::subcore    // subcore model or shared model: 
  #define DRAM_MODEL dram_model::GDDR6      // memory type; checked
  #define WARP_SCHEDS_PER_SM 1           // number of warp schedulers per SM; NVIDIA-Turing-Architecture-Whitepaper page17

  // number of SASS HMMA per 16x16 PTX WMMA for FP16 - FP32 accumlate operation
  #define SASS_hmma_per_PTX_wmma 4 

  // These vars are almost constant between HW generation
  // see slide 24 from Nvidia at
  // https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21730-inside-the-nvidia-ampere-architecture.pdf
  #define L2_BANKS_PER_MEM_CHANNEL 2
  #define L2_BANK_WIDTH_in_BYTE 32

  #endif

I use tuner.py to generate config file and move it to the folders as required. Then when I use the run_simulations.py script to launch all 16 possible combinations, the error happends:

../job_launching/run_simulations.py \
 -T /home/data/userhome/liqiang/lab/gpu/accel-sim-framework/hw_run/traces/device-0/11.0 \
 -C TITANRTX-SASS,\
TITANRTX-SASS-LINEAR-RR-32B-FRFCFS,\
TITANRTX-SASS-LINEAR-RR-32B-FCFS,\
TITANRTX-SASS-LINEAR-RR-256B-FRFCFS,\
TITANRTX-SASS-LINEAR-RR-256B-FCFS,\
TITANRTX-SASS-LINEAR-GTO-32B-FRFCFS,\
TITANRTX-SASS-LINEAR-GTO-32B-FCFS,\
TITANRTX-SASS-LINEAR-GTO-256B-FRFCFS,\
TITANRTX-SASS-LINEAR-GTO-256B-FCFS,\
TITANRTX-SASS-IPOLY-RR-32B-FRFCFS,\
TITANRTX-SASS-IPOLY-RR-32B-FCFS,\
TITANRTX-SASS-IPOLY-RR-256B-FRFCFS,\
TITANRTX-SASS-IPOLY-RR-256B-FCFS,\
TITANRTX-SASS-IPOLY-GTO-32B-FRFCFS,\
TITANRTX-SASS-IPOLY-GTO-32B-FCFS,\
TITANRTX-SASS-IPOLY-GTO-256B-FRFCFS,\
TITANRTX-SASS-IPOLY-GTO-256B-FCFS \
-N tuning -B GPU_Microbenchmark
------------------
**********************************************************
**********************************************************
l1_bw_32f_unroll_large-NO_ARGS--TITANRTX-SASS-IPOLY-GTO-256B-FCFS. Status=COMPLETE_ERR_FILE_HAS_CONTENTS
Last 10 line of /home/data/userhome/liqiang/lab/gpu/accel-sim-framework/util/job_launching/../../sim_run_11.0/l1_bw_32f_unroll_large/NO_ARGS/TITANRTX-SASS-IPOLY-GTO-256B-FCFS/l1_bw_32f_unroll_large-NO_ARGS.accelsim-commit-2260456ea5e6a1420f5734f145a4b7d8ab1d4737_modified_0.0.o574
------------------

        *** GPGPU-Sim Simulator Version 4.2.0  [build gpgpu-sim_git-commit-6aa7ed16_modified_0.0] ***

Accel-Sim [build accelsim-commit-2260456ea5e6a1420f5734f145a4b7d8ab1d4737_modified_0.0]
doing:  /home/data/userhome/liqiang/lab/gpu/accel-sim-framework/util/job_launching/../../sim_run_11.0/gpgpu-sim-builds/accelsim-commit-2260456ea5e6a1420f5734f145a4b7d8ab1d4737_modified_0.0/accel-sim.out  -config ./gpgpusim.config -trace ./traces/kernelslist.g
doing export CUDA_LAUNCH_BLOCKING=1
doing: export PATH=/home/data/userhome/liqiang/lab/gpu/accel-sim-framework/gpu-simulator/gpgpu-sim/bin:/usr/local/cuda-11.0/bin:/home/data/userhome/liqiang/tool/makedepend106/bin:/usr/local/cuda-11.0/bin:/home/data/userhome/liqiang/tool/vscode/VSCode-linux-x64/bin:/home/data/userhome/liqiang/tool/swig402/bin:/home/data/userhome/liqiang/lab/package/cmake-3.18.3-Linux-x86_64/bin:/home/data/userhome/liqiang/tool/cmake-3.18.3-Linux-x86_64/bin:home/data/userhome/liqiang/lab/hpvm-release/hpvm/build:/home/data/userhome/liqiang/tool/pycharm-community-2021.3/bin:/home/data/userhome/liqiang/Downloads/GmSSl/bin:/home/data/userhome/liqiang/Downloads/valgrind-3.19.0/bin:/home/data/userhome/liqiang/Downloads/cmake-3.23.1/bin:/home/data/userhome/liqiang/Downloads/qemu-riscv/bin:/home/data/userhome/liqiang/Documents:/home/data/userhome/liqiang/tool/anaconda3/envs/accelsim/bin:/home/data/userhome/liqiang/tool/anaconda3/condabin:/home/data/userhome/liqiang/Downloads/rar:/home/data/userhome/liqiang/Downloads/verilator/bin:/home/data/userhome/liqiang/.cargo/bin:/home/data/userhome/liqiang/.vscode-server/bin/dc96b837cf6bb4af9cd736aa3af08cf8279f7685/bin/remote-cli:/usr/local/cuda-10.1/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/user/Documents/llvm/bin:/usr/local/cuda-10.1/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/data/userhome/wanli/Downloads/clion-2021.2.3/bin:/newlib/bin:/linux/bin
doing
doing: export OPENCL_REMOTE_GPU_HOST=REPLACE_REMOTE_HOST
------------------

Contents of /home/data/userhome/liqiang/lab/gpu/accel-sim-framework/util/job_launching/../../sim_run_11.0/l1_bw_32f_unroll_large/NO_ARGS/TITANRTX-SASS-IPOLY-GTO-256B-FCFS/l1_bw_32f_unroll_large-NO_ARGS.accelsim-commit-2260456ea5e6a1420f5734f145a4b7d8ab1d4737_modified_0.0.e574
------------------

GPGPU-Sim ** ERROR: Unknown Option: '-memory_partition_indexing' 

------------------
**********************************************************

Sleeping for 30s

accel-sim-framework: release gpgpu-sim: dev

Only one config, TITANRTX-SASS successfully execute simulation, and the others all failed. So How to fix the errors here? If would be so great if you give me some hints.

Leon924 commented 2 weeks ago

Finally, I see the mistakes in repo, in ./accel-sim-framework/util/joblaunching/configs/define-standard-cfgs.yml: the parameters should add "gpgpu" prefix like: LINEAR: extra_params: "-gpgpu_memory_partition_indexing 0" IPOLY: extra_params: "-gpgpu_memory_partition_indexing 2"

However, the simulation of all IPOLY-related trials failed. I try to run /home/data/userhome/liqiang/lab/gpu/accel-sim-framework/sim_run_11.0/l1_bw_32f/NO_ARGS/TITANRTX-SASS-IPOLY-GTO-32B-FRFCFS/justrun.sh, it show:

Accel-Sim [build accelsim-commit-2260456ea5e6a1420f5734f145a4b7d8ab1d4737_modified_4.0]

        *** GPGPU-Sim Simulator Version 4.2.0  [build gpgpu-sim_git-commit-6aa7ed16_modified_4.0] ***

GPGPU-Sim: Configuration options:

-save_embedded_ptx                      0 # saves ptx files embedded in binary as <n>.ptx
-keep                                   0 # keep intermediate files created by GPGPU-Sim when interfacing with external programs
-gpgpu_ptx_save_converted_ptxplus                    0 # Saved converted ptxplus to a file
-gpgpu_occupancy_sm_number                   75 # The SM number to pass to ptxas when getting register usage for computing GPU occupancy. This parameter is required in the config.
-ptx_opcode_latency_int           4,4,4,4,21 # Opcode latencies for integers <ADD,MAX,MUL,MAD,DIV,SHFL>Default 1,1,19,25,145,32
-ptx_opcode_latency_fp           4,4,4,4,39 # Opcode latencies for single precision floating points <ADD,MAX,MUL,MAD,DIV>Default 1,1,1,1,30
-ptx_opcode_latency_dp      54,54,54,54,330 # Opcode latencies for double precision floating points <ADD,MAX,MUL,MAD,DIV>Default 8,8,8,8,335
-ptx_opcode_latency_sfu                   21 # Opcode latencies for SFU instructionsDefault 8
-ptx_opcode_latency_tesnor                   32 # Opcode latencies for Tensor instructionsDefault 64
-ptx_opcode_initiation_int            2,2,2,2,2 # Opcode initiation intervals for integers <ADD,MAX,MUL,MAD,DIV,SHFL>Default 1,1,4,4,32,4
-ptx_opcode_initiation_fp            2,2,2,2,4 # Opcode initiation intervals for single precision floating points <ADD,MAX,MUL,MAD,DIV>Default 1,1,1,1,5
-ptx_opcode_initiation_dp      64,64,64,64,130 # Opcode initiation intervals for double precision floating points <ADD,MAX,MUL,MAD,DIV>Default 8,8,8,8,130
-ptx_opcode_initiation_sfu                    8 # Opcode initiation intervals for sfu instructionsDefault 8
-ptx_opcode_initiation_tensor                   32 # Opcode initiation intervals for tensor instructionsDefault 64
-cdp_latency         7200,8000,100,12000,1600 # CDP API latency <cudaStreamCreateWithFlags, cudaGetParameterBufferV2_init_perWarp, cudaGetParameterBufferV2_perKernel, cudaLaunchDeviceV2_init_perWarp, cudaLaunchDevicV2_perKernel>Default 7200,8000,100,12000,1600
-network_mode                           2 # Interconnection network mode
-inter_config_file                   mesh # Interconnection network config file
-icnt_in_buffer_limit                  512 # in_buffer_limit
-icnt_out_buffer_limit                  512 # out_buffer_limit
-icnt_subnets                           2 # subnets
-icnt_arbiter_algo                      1 # arbiter_algo
-icnt_verbose                           0 # inct_verbose
-icnt_grant_cycles                      1 # grant_cycles
-gpgpu_ptx_use_cuobjdump                    1 # Use cuobjdump to extract ptx and sass from binaries
-gpgpu_experimental_lib_support                    0 # Try to extract code from cuda libraries [Broken because of unknown cudaGetExportTable]
-checkpoint_option                      0 #  checkpointing flag (0 = no checkpoint)
-checkpoint_kernel                      1 #  checkpointing during execution of which kernel (1- 1st kernel)
-checkpoint_CTA                         0 #  checkpointing after # of CTA (< less than total CTA)
-resume_option                          0 #  resume flag (0 = no resume)
-resume_kernel                          0 #  Resume from which kernel (1= 1st kernel)
-resume_CTA                             0 #  resume from which CTA 
-checkpoint_CTA_t                       0 #  resume from which CTA 
-checkpoint_insn_Y                      0 #  resume from which CTA 
-gpgpu_ptx_convert_to_ptxplus                    0 # Convert SASS (native ISA) to ptxplus and run ptxplus
-gpgpu_ptx_force_max_capability                   75 # Force maximum compute capability
-gpgpu_ptx_inst_debug_to_file                    0 # Dump executed instructions' debug information to file
-gpgpu_ptx_inst_debug_file       inst_debug.txt # Executed instructions' debug output file
-gpgpu_ptx_inst_debug_thread_uid                    1 # Thread UID for executed instructions' debug output
-gpgpu_simd_model                       1 # 1 = post-dominator
-gpgpu_shader_core_pipeline              1024:32 # shader core pipeline config, i.e., {<nthread>:<warpsize>}
-gpgpu_tex_cache:l1  N:4:128:256,L:R:m:N:L,T:512:8,128:2 # per-shader L1 texture cache  (READ-ONLY) config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>:<rf>}
-gpgpu_const_cache:l1 N:128:64:8,L:R:f:N:L,S:2:64,4 # per-shader L1 constant memory cache  (READ-ONLY) config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>} 
-gpgpu_cache:il1     N:64:128:16,L:R:f:N:L,S:2:48,4 # shader L1 instruction cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>} 
-gpgpu_cache:dl1     S:4:128:64,L:T:m:L:L,A:256:32,16:0,32 # per-shader L1 data cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | none}
-gpgpu_l1_cache_write_ratio                   25 # L1D write ratio
-gpgpu_l1_banks                         4 # The number of L1 cache banks
-gpgpu_l1_banks_byte_interleaving                   32 # l1 banks byte interleaving granularity
-gpgpu_l1_banks_hashing_function                    0 # l1 banks hashing function
-gpgpu_l1_latency                      32 # L1 Hit Latency
-gpgpu_smem_latency                    30 # smem Latency
-gpgpu_cache:dl1PrefL1                 none # per-shader L1 data cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | none}
-gpgpu_cache:dl1PrefShared                 none # per-shader L1 data cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | none}
-gpgpu_gmem_skip_L1D                    0 # global memory access skip L1D cache (implements -Xptxas -dlcm=cg, default=no skip)
-gpgpu_perfect_mem                      0 # enable perfect memory mode (no cache miss)
-n_regfile_gating_group                    4 # group of lanes that should be read/written together)
-gpgpu_clock_gated_reg_file                    0 # enable clock gated reg file for power calculations
-gpgpu_clock_gated_lanes                    0 # enable clock gated lanes for power calculations
-gpgpu_shader_registers                65536 # Number of registers per shader core. Limits number of concurrent CTAs. (default 8192)
-gpgpu_registers_per_block                65536 # Maximum number of registers per CTA. (default 8192)
-gpgpu_ignore_resources_limitation                    0 # gpgpu_ignore_resources_limitation (default 0)
-gpgpu_shader_cta                      16 # Maximum number of concurrent CTAs in shader (default 32)
-gpgpu_num_cta_barriers                   16 # Maximum number of named barriers per CTA (default 16)
-gpgpu_n_clusters                      72 # number of processing clusters
-gpgpu_n_cores_per_cluster                    1 # number of simd cores per cluster
-gpgpu_n_cluster_ejection_buffer_size                   32 # number of packets in ejection buffer
-gpgpu_n_ldst_response_buffer_size                    2 # number of response packets in ld/st unit ejection buffer
-gpgpu_shmem_per_block                49152 # Size of shared memory per thread block or CTA (default 48kB)
-gpgpu_shmem_size                   65536 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_option       0,8,16,32,64,64 # Option list of shared memory sizes
-gpgpu_unified_l1d_size                  128 # Size of unified data cache(L1D + shared memory) in KB
-gpgpu_adaptive_cache_config                    1 # adaptive_cache_config
-gpgpu_shmem_sizeDefault                65536 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_size_PrefL1                16384 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_size_PrefShared                16384 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_num_banks                   32 # Number of banks in the shared memory in each shader core (default 16)
-gpgpu_shmem_limited_broadcast                    0 # Limit shared memory to do one broadcast per cycle (default on)
-gpgpu_shmem_warp_parts                    1 # Number of portions a warp is divided into for shared memory bank conflict check 
-gpgpu_mem_unit_ports                    1 # The number of memory transactions allowed per core cycle
-gpgpu_shmem_warp_parts                    1 # Number of portions a warp is divided into for shared memory bank conflict check 
-gpgpu_warpdistro_shader                   -1 # Specify which shader core to collect the warp size distribution from
-gpgpu_warp_issue_shader                    0 # Specify which shader core to collect the warp issue distribution from
-gpgpu_local_mem_map                    1 # Mapping from local memory space address to simulated GPU physical address space (default = enabled)
-gpgpu_num_reg_banks                   16 # Number of register banks (default = 8)
-gpgpu_reg_bank_use_warp_id                    0 # Use warp ID in mapping registers to banks (default = off)
-gpgpu_sub_core_model                    1 # Sub Core Volta/Pascal model (default = off)
-gpgpu_enable_specialized_operand_collector                    0 # enable_specialized_operand_collector
-gpgpu_operand_collector_num_units_sp                    4 # number of collector units (default = 4)
-gpgpu_operand_collector_num_units_dp                    0 # number of collector units (default = 0)
-gpgpu_operand_collector_num_units_sfu                    4 # number of collector units (default = 4)
-gpgpu_operand_collector_num_units_int                    0 # number of collector units (default = 0)
-gpgpu_operand_collector_num_units_tensor_core                    4 # number of collector units (default = 4)
-gpgpu_operand_collector_num_units_mem                    2 # number of collector units (default = 2)
-gpgpu_operand_collector_num_units_gen                    8 # number of collector units (default = 0)
-gpgpu_operand_collector_num_in_ports_sp                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_dp                    0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_in_ports_sfu                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_int                    0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_in_ports_tensor_core                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_mem                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_gen                    8 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_out_ports_sp                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_dp                    0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_out_ports_sfu                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_int                    0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_out_ports_tensor_core                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_mem                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_gen                    8 # number of collector unit in ports (default = 0)
-gpgpu_coalesce_arch                   75 # Coalescing arch (GT200 = 13, Fermi = 20)
-gpgpu_num_sched_per_core                    4 # Number of warp schedulers per core
-gpgpu_max_insn_issue_per_warp                    1 # Max number of instructions that can be issued per warp in one cycle by scheduler (either 1 or 2)
-gpgpu_dual_issue_diff_exec_units                    1 # should dual issue use two different execution unit resources (Default = 1)
-gpgpu_simt_core_sim_order                    1 # Select the simulation order of cores in a cluster (0=Fix, 1=Round-Robin)
-gpgpu_pipeline_widths 4,4,4,4,4,4,4,4,4,4,8,4,4 # Pipeline widths ID_OC_SP,ID_OC_DP,ID_OC_INT,ID_OC_SFU,ID_OC_MEM,OC_EX_SP,OC_EX_DP,OC_EX_INT,OC_EX_SFU,OC_EX_MEM,EX_WB,ID_OC_TENSOR_CORE,OC_EX_TENSOR_CORE
-gpgpu_tensor_core_avail                    1 # Tensor Core Available (default=0)
-gpgpu_num_sp_units                     4 # Number of SP units (default=1)
-gpgpu_num_dp_units                     4 # Number of DP units (default=0)
-gpgpu_num_int_units                    4 # Number of INT units (default=0)
-gpgpu_num_sfu_units                    4 # Number of SF units (default=1)
-gpgpu_num_tensor_core_units                    4 # Number of tensor_core units (default=1)
-gpgpu_num_mem_units                    1 # Number if ldst units (default=1) WARNING: not hooked up to anything
-gpgpu_scheduler                      lrr # Scheduler configuration: < lrr | gto | two_level_active > If two_level_active:<num_active_warps>:<inner_prioritization>:<outer_prioritization>For complete list of prioritization values see shader.h enum scheduler_prioritization_typeDefault: gto
-gpgpu_concurrent_kernel_sm                    0 # Support concurrent kernels on a SM (default = disabled)
-gpgpu_perfect_inst_const_cache                    1 # perfect inst and const cache mode, so all inst and const hits in the cache(default = disabled)
-gpgpu_inst_fetch_throughput                    4 # the number of fetched intruction per warp each cycle
-gpgpu_reg_file_port_throughput                    2 # the number ports of the register file
-specialized_unit_1         1,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_2       1,4,200,4,4,TEX # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_3      1,4,2,4,4,TENSOR # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_4         1,4,4,4,4,UDP # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_5         0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_6         0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_7         0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_8         0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-gpgpu_perf_sim_memcpy                    1 # Fill the L2 cache on memcpy
-gpgpu_simple_dram_model                    0 # simple_dram_model with fixed latency and BW
-gpgpu_dram_scheduler                    1 # 0 = fifo, 1 = FR-FCFS (defaul)
-gpgpu_dram_partition_queues          64:64:64:64 # i2$:$2d:d2$:$2i
-l2_ideal                               0 # Use a ideal L2 cache that always hit
-gpgpu_cache:dl2     S:512:128:16,L:B:m:L:X,A:192:4,32:0,32 # unified banked L2 data cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>}
-gpgpu_cache:dl2_texture_only                    0 # L2 cache used for texture only
-gpgpu_n_mem                            3 # number of memory modules (e.g. memory controllers) in gpu
-gpgpu_n_sub_partition_per_mchannel                    2 # number of memory subpartition in each memory module
-gpgpu_n_mem_per_ctrlr                    1 # number of memory chips per memory controller
-gpgpu_memlatency_stat                   14 # track and display latency statistics 0x2 enables MC, 0x4 enables queue logs
-gpgpu_frfcfs_dram_sched_queue_size                   64 # 0 = unlimited (default); # entries per chip
-gpgpu_dram_return_queue_size                  192 # 0 = unlimited (default); # entries per chip
-gpgpu_dram_buswidth                   16 # default = 4 bytes (8 bytes per cycle at DDR)
-gpgpu_dram_burst_length                    2 # Burst length of each DRAM request (default = 4 data bus cycle)
-dram_data_command_freq_ratio                    2 # Frequency ratio between DRAM data bus and command bus (default = 2 times, i.e. DDR)
-gpgpu_dram_timing_opt nbk=16:CCD=1:RRD=29:RCD=99:RAS=232:RP=99:RC=330:CL=99:WL=15:CDLR=22:WR=85:nbkgrp=4:CCDL=15:RTPL=29 # DRAM timing parameters = {nbk:tCCD:tRRD:tRCD:tRAS:tRP:tRC:CL:WL:tCDLR:tWR:nbkgrp:tCCDL:tRTPL}
-gpgpu_l2_rop_latency                  198 # ROP queue latency (default 85)
-dram_latency                          94 # DRAM latency (default 30)
-dram_dual_bus_interface                    1 # dual_bus_interface (default = 0) 
-dram_bnk_indexing_policy                    0 # dram_bnk_indexing_policy (0 = normal indexing, 1 = Xoring with the higher bits) (Default = 0)
-dram_bnkgrp_indexing_policy                    1 # dram_bnkgrp_indexing_policy (0 = take higher bits, 1 = take lower bits) (Default = 0)
-dram_seperate_write_queue_enable                    0 # Seperate_Write_Queue_Enable
-dram_write_queue_size             32:28:16 # Write_Queue_Size
-dram_elimnate_rw_turnaround                    0 # elimnate_rw_turnaround i.e set tWTR and tRTW = 0
-icnt_flit_size                        40 # icnt_flit_size
-gpgpu_mem_addr_mapping dramid@5;00000000.00000000.00000000.00000000.0000RRRR.RRRRRRRR.RBBBCCCC.BCCSSSSS # mapping memory address to dram model {dramid@<start bit>;<memory address map>}
-gpgpu_mem_addr_test                    0 # run sweep test to check address mapping for aliased address
-gpgpu_mem_address_mask                    1 # 0 = old addressing mask, 1 = new addressing mask, 2 = new add. mask + flipped bank sel and chip sel bits
-gpgpu_memory_partition_indexing                    2 # 0 = no indexing, 1 = bitwise xoring, 2 = IPoly, 3 = custom indexing
-accelwattch_xml_file accelwattch_sass_sim.xml # AccelWattch XML file
-power_simulation_enabled                    0 # Turn on power simulator (1=On, 0=Off)
-power_per_cycle_dump                    0 # Dump detailed power output each cycle
-hw_perf_file_name            hw_perf.csv # Hardware Performance Statistics file
-hw_perf_bench_name                       # Kernel Name in Hardware Performance Statistics file
-power_simulation_mode                    0 # Switch performance counter input for power simulation (0=Sim, 1=HW, 2=HW-Sim Hybrid)
-dvfs_enabled                           0 # Turn on DVFS for power model
-aggregate_power_stats                    0 # Accumulate power across all kernels
-accelwattch_hybrid_perfsim_L1_RH                    0 # Get L1 Read Hits for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_L1_RM                    0 # Get L1 Read Misses for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_L1_WH                    0 # Get L1 Write Hits for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_L1_WM                    0 # Get L1 Write Misses for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_L2_RH                    0 # Get L2 Read Hits for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_L2_RM                    0 # Get L2 Read Misses for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_L2_WH                    0 # Get L2 Write Hits for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_L2_WM                    0 # Get L2 Write Misses for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_CC_ACC                    0 # Get Constant Cache Acesses for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_SHARED_ACC                    0 # Get Shared Memory Acesses for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_DRAM_RD                    0 # Get DRAM Reads for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_DRAM_WR                    0 # Get DRAM Writes for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_NOC                    0 # Get Interconnect Acesses for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_PIPE_DUTY                    0 # Get Pipeline Duty Cycle Acesses for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_NUM_SM_IDLE                    0 # Get Number of Idle SMs for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_CYCLES                    0 # Get Executed Cycles for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_VOLTAGE                    0 # Get Chip Voltage for Accelwattch-Hybrid from Accel-Sim
-power_trace_enabled                    0 # produce a file for the power trace (1=On, 0=Off)
-power_trace_zlevel                     6 # Compression level of the power trace output log (0=no comp, 9=highest)
-steady_power_levels_enabled                    0 # produce a file for the steady power levels (1=On, 0=Off)
-steady_state_definition                  8:4 # allowed deviation:number of samples
-gpgpu_max_cycle                        0 # terminates gpu simulation early (0 = no limit)
-gpgpu_max_insn                         0 # terminates gpu simulation early (0 = no limit)
-gpgpu_max_cta                          0 # terminates gpu simulation early (0 = no limit)
-gpgpu_max_completed_cta                    0 # terminates gpu simulation early (0 = no limit)
-gpgpu_runtime_stat                   500 # display runtime statistics such as dram utilization {<freq>:<flag>}
-liveness_message_freq                    1 # Minimum number of seconds between simulation liveness messages (0 = always print)
-gpgpu_compute_capability_major                    7 # Major compute capability version number
-gpgpu_compute_capability_minor                    5 # Minor compute capability version number
-gpgpu_flush_l1_cache                    1 # Flush L1 cache at the end of each kernel call
-gpgpu_flush_l2_cache                    0 # Flush L2 cache at the end of each kernel call
-gpgpu_deadlock_detect                    1 # Stop the simulation at deadlock (1=on (default), 0=off)
-gpgpu_ptx_instruction_classification                    0 # if enabled will classify ptx instruction types per kernel (Max 255 kernels now)
-gpgpu_ptx_sim_mode                     0 # Select between Performance (default) or Functional simulation (1)
-gpgpu_clock_domains  1200:1200:1200:7001 # Clock Domain Frequencies in MhZ {<Core Clock>:<ICNT Clock>:<L2 Clock>:<DRAM Clock>}
-gpgpu_max_concurrent_kernel                   32 # maximum kernels that can run concurrently on GPU, set this value according to max resident grids for your compute capability
-gpgpu_cflog_interval                    0 # Interval between each snapshot in control flow logger
-visualizer_enabled                     0 # Turn on visualizer output (1=On, 0=Off)
-visualizer_outputfile                 NULL # Specifies the output log file for visualizer
-visualizer_zlevel                      6 # Compression level of the visualizer output log (0=no comp, 9=highest)
-gpgpu_stack_size_limit                 1024 # GPU thread stack size
-gpgpu_heap_size_limit              8388608 # GPU malloc heap size 
-gpgpu_runtime_sync_depth_limit                    2 # GPU device runtime synchronize depth
-gpgpu_runtime_pending_launch_count_limit                 2048 # GPU device runtime pending launch count
-trace_enabled                          0 # Turn on traces
-trace_components                    none # comma seperated list of traces to enable. Complete list found in trace_streams.tup. Default none
-trace_sampling_core                    0 # The core which is printed using CORE_DPRINTF. Default 0
-trace_sampling_memory_partition                   -1 # The memory partition which is printed using MEMPART_DPRINTF. Default -1 (i.e. all)
-enable_ptx_file_line_stats                    1 # Turn on PTX source line statistic profiling. (1 = On)
-ptx_line_stats_filename gpgpu_inst_stats.txt # Output file for PTX source line statistics.
-gpgpu_kernel_launch_latency                 7027 # Kernel launch latency in cycles. Default: 0
-gpgpu_cdp_enabled                      0 # Turn on CDP
-gpgpu_TB_launch_latency                    0 # thread block launch latency in cycles. Default: 0
-trace               ./traces/kernelslist.g # traces kernel filetraces kernel file directory
-trace_opcode_latency_initiation_int                  4,2 # Opcode latencies and initiation for integers in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_sp                  4,2 # Opcode latencies and initiation for sp in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_dp                54,64 # Opcode latencies and initiation for dp in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_sfu                 21,8 # Opcode latencies and initiation for sfu in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_tensor                  2,2 # Opcode latencies and initiation for tensor in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_spec_op_1                  4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_2                200,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_3                  2,2 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_4                  4,1 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_5                  4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_6                  4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_7                  4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_8                  4,4 # specialized unit config <latency,initiation>
DRAM Timing Options:
nbk                                    16 # number of banks
CCD                                     1 # column to column delay
RRD                                    29 # minimal delay between activation of rows in different banks
RCD                                    99 # row to column delay
RAS                                   232 # time needed to activate row
RP                                     99 # time needed to precharge (deactivate) row
RC                                    330 # row cycle time
CDLR                                   22 # switching from write to read (changes tWTR)
WR                                     85 # last data-in to row precharge
CL                                     99 # CAS latency
WL                                     15 # Write latency
nbkgrp                                  4 # number of bank groups
CCDL                                   15 # column to column delay between accesses to different bank groups
RTPL                                   29 # read to precharge delay between accesses to different bank groups
Total number of memory sub partition = 6
addr_dec_mask[CHIP]  = 0000000000000000         high:64 low:0
addr_dec_mask[BK]    = 0000000000007080         high:15 low:7
addr_dec_mask[ROW]   = 000000000fff8000         high:28 low:15
addr_dec_mask[COL]   = 0000000000000f7f         high:12 low:0
addr_dec_mask[BURST] = 000000000000001f         high:5 low:0
sub_partition_id_mask = 0000000000000080
GPGPU-Sim uArch: clock freqs: 1200000000.000000:1200000000.000000:1200000000.000000:7001000000.000000
GPGPU-Sim uArch: clock periods: 0.00000000083333333333:0.00000000083333333333:0.00000000083333333333:0.00000000014283673761
*** Initializing Memory Statistics ***
GPGPU-Sim uArch: performance model initialization complete.
Processing kernel ./traces/kernel-1.traceg
-kernel name = _Z6l1_latPjS_PmS0_
-kernel id = 1
-grid dim = (1,1,1)
-block dim = (1,1,1)
-shmem = 0
-nregs = 24
-binary version = 70
-cuda stream id = 0
-shmem base_addr = 0x00007f587c000000
-local mem base_addr = 0x00007f587e000000
-nvbit version = 1.5.3
-accelsim tracer version = 3
Header info loaded for kernel command : ./traces/kernel-1.traceg
launching kernel name: _Z6l1_latPjS_PmS0_ uid: 1
GPGPU-Sim uArch: Shader 0 bind to kernel 1 '_Z6l1_latPjS_PmS0_'
GPGPU-Sim uArch: CTA/core = 16, limited by: cta_limit
GPGPU-Sim: Reconfigure L1 cache to 128KB
thread block = 0,0,0
accel-sim.out: hashing.cc:89: unsigned int ipoly_hash_function(new_addr_type, unsigned int, unsigned int): Assertion `"\nmemory_partition_indexing error: The number of " "channels should be " "16, 32 or 64 for the hashing IPOLY index function. other banks " "numbers are not supported. Generate it by yourself! \n" && 0' failed.
./justrun.sh: line 1: 3267182 Aborted                 (core dumped) /home/data/userhome/liqiang/lab/gpu/accel-sim-framework/util/job_launching/../../sim_run_11.0/gpgpu-sim-builds/accelsim-commit-2260456ea5e6a1420f5734f145a4b7d8ab1d4737_modified_4.0/accel-sim.out -config ./gpgpusim.config -trace ./traces/kernelslist.g

which indicates I have wrong bank number setting, I want to know which config in my config file should be revised?

JRPan commented 2 weeks ago

First, don't mix release/dev. If you are using dev for gpgpu-sim then use dev for accel-sim as well.

The last line just tells you the problem.

accel-sim.out: hashing.cc:89: unsigned int ipoly_hash_function(new_addr_type, unsigned int, unsigned int): Assertion `"\nmemory_partition_indexing error: The number of " "channels should be " "16, 32 or 64 for the hashing IPOLY index function. other banks " "numbers are not supported. Generate it by yourself! \n" && 0' failed.

Why is gpgpu_n_mem only 3 in your config? Is this expected?

Leon924 commented 2 weeks ago

the gpgpu_n_mem parameter is generated by tuner.py, it is wrong for TITAN RTX. In white paper of Turing arch, the number of memory controllers of TU102 (TITAN RTX) is 12, same as config of RTX2060. So, How come the tuner.py use GPU_Microbenchmark to generate a gpgpu_n_mem of 3? here is the stats.txt obtained by runing ./GPU_Microbenchmark/run_all.sh | tee stats.txt, the output stats.txt file shows:

/////////////////////////////////
running ./mem_config microbenchmark
Global memory size = 24 GB
Memory Clock rate = 7001 Mhz
Memory Bus Width = 384 bit
Memory type = HBM
Memory channels = 3

//Accel_Sim config: 
-gpgpu_n_mem 3
-gpgpu_n_mem_per_ctrlr 1
-gpgpu_dram_buswidth 16
-gpgpu_dram_burst_length 2
-dram_data_command_freq_ratio 2
-dram_dual_bus_interface 1
-gpgpu_dram_timing_opt nbk=16:CCD=1:RRD=29:RCD=99:RAS=232:RP=99:RC=330:CL=99:WL=15:CDLR=22:WR=85:nbkgrp=4:CCDL=15:RTPL=29

why the testbench give us a mistaken number? here is my newest hw_def file :

// TITAN RTX HW def file
// based on TU102
#ifndef TURING_TITANRTX_DEF_H
#define TURING_TITANRTX_DEF_H

#include "./common/common.h"
#include "./common/deviceQuery.h"

#define L1_SIZE (96 * 1024) // Max L1 size in bytes, NVIDIA-Turing-Architecture-Whitepaper page13

#define CLK_FREQUENCY 1350 // Base frequency in MHz (Boost clock can go up to 1770 MHz)

#define ISSUE_MODEL issue_model::single   // single issue core or dual issue
#define CORE_MODEL core_model::subcore    // subcore model or shared model: 
#define DRAM_MODEL dram_model::GDDR6      // memory type; checked
#define WARP_SCHEDS_PER_SM 4         // number of warp schedulers per SM; NVIDIA-Turing-Architecture-Whitepaper page17, each processing block has one, each SM has four processing blocks.

// number of SASS HMMA per 16x16 PTX WMMA for FP16 - FP32 accumlate operation
#define SASS_hmma_per_PTX_wmma 4 

// These vars are almost constant between HW generation
// see slide 24 from Nvidia at
// https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21730-inside-the-nvidia-ampere-architecture.pdf
#define L2_BANKS_PER_MEM_CHANNEL 2 //6 L2 banks, 3 memory channels
#define L2_BANK_WIDTH_in_BYTE 32 //32*6bank =192 L2 Cache BW

#endif