Open Leon924 opened 4 months ago
We don't have one for TITAN RTX. You can change the 2060 config. Change the number of SMs and clock frequency to match your card. SM-related stuff should be similar. Change L2 size as well.
Thanks. I have tried to generate an CONFIG file in acclesim-tuner, here is my turing_TITANRTX_hw_def.h file. I changed frequency, L1cache size and WARP_SCHEDS_PER_SM from turing_rtx2060 header file.
// TITAN RTX HW def file
// based on TU102
#ifndef TURING_TITANRTX_DEF_H
#define TURING_TITANRTX_DEF_H
#include "./common/common.h"
#include "./common/deviceQuery.h"
#define L1_SIZE (96 * 1024) // Max L1 size in bytes, NVIDIA-Turing-Architecture-Whitepaper page13
#define CLK_FREQUENCY 1350 // Base frequency in MHz (Boost clock can go up to 1770 MHz)
#define ISSUE_MODEL issue_model::single // single issue core or dual issue
#define CORE_MODEL core_model::subcore // subcore model or shared model:
#define DRAM_MODEL dram_model::GDDR6 // memory type; checked
#define WARP_SCHEDS_PER_SM 1 // number of warp schedulers per SM; NVIDIA-Turing-Architecture-Whitepaper page17
// number of SASS HMMA per 16x16 PTX WMMA for FP16 - FP32 accumlate operation
#define SASS_hmma_per_PTX_wmma 4
// These vars are almost constant between HW generation
// see slide 24 from Nvidia at
// https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21730-inside-the-nvidia-ampere-architecture.pdf
#define L2_BANKS_PER_MEM_CHANNEL 2
#define L2_BANK_WIDTH_in_BYTE 32
#endif
I use tuner.py to generate config file and move it to the folders as required. Then when I use the run_simulations.py script to launch all 16 possible combinations, the error happends:
../job_launching/run_simulations.py \
-T /home/data/userhome/liqiang/lab/gpu/accel-sim-framework/hw_run/traces/device-0/11.0 \
-C TITANRTX-SASS,\
TITANRTX-SASS-LINEAR-RR-32B-FRFCFS,\
TITANRTX-SASS-LINEAR-RR-32B-FCFS,\
TITANRTX-SASS-LINEAR-RR-256B-FRFCFS,\
TITANRTX-SASS-LINEAR-RR-256B-FCFS,\
TITANRTX-SASS-LINEAR-GTO-32B-FRFCFS,\
TITANRTX-SASS-LINEAR-GTO-32B-FCFS,\
TITANRTX-SASS-LINEAR-GTO-256B-FRFCFS,\
TITANRTX-SASS-LINEAR-GTO-256B-FCFS,\
TITANRTX-SASS-IPOLY-RR-32B-FRFCFS,\
TITANRTX-SASS-IPOLY-RR-32B-FCFS,\
TITANRTX-SASS-IPOLY-RR-256B-FRFCFS,\
TITANRTX-SASS-IPOLY-RR-256B-FCFS,\
TITANRTX-SASS-IPOLY-GTO-32B-FRFCFS,\
TITANRTX-SASS-IPOLY-GTO-32B-FCFS,\
TITANRTX-SASS-IPOLY-GTO-256B-FRFCFS,\
TITANRTX-SASS-IPOLY-GTO-256B-FCFS \
-N tuning -B GPU_Microbenchmark
------------------
**********************************************************
**********************************************************
l1_bw_32f_unroll_large-NO_ARGS--TITANRTX-SASS-IPOLY-GTO-256B-FCFS. Status=COMPLETE_ERR_FILE_HAS_CONTENTS
Last 10 line of /home/data/userhome/liqiang/lab/gpu/accel-sim-framework/util/job_launching/../../sim_run_11.0/l1_bw_32f_unroll_large/NO_ARGS/TITANRTX-SASS-IPOLY-GTO-256B-FCFS/l1_bw_32f_unroll_large-NO_ARGS.accelsim-commit-2260456ea5e6a1420f5734f145a4b7d8ab1d4737_modified_0.0.o574
------------------
*** GPGPU-Sim Simulator Version 4.2.0 [build gpgpu-sim_git-commit-6aa7ed16_modified_0.0] ***
Accel-Sim [build accelsim-commit-2260456ea5e6a1420f5734f145a4b7d8ab1d4737_modified_0.0]
doing: /home/data/userhome/liqiang/lab/gpu/accel-sim-framework/util/job_launching/../../sim_run_11.0/gpgpu-sim-builds/accelsim-commit-2260456ea5e6a1420f5734f145a4b7d8ab1d4737_modified_0.0/accel-sim.out -config ./gpgpusim.config -trace ./traces/kernelslist.g
doing export CUDA_LAUNCH_BLOCKING=1
doing: export PATH=/home/data/userhome/liqiang/lab/gpu/accel-sim-framework/gpu-simulator/gpgpu-sim/bin:/usr/local/cuda-11.0/bin:/home/data/userhome/liqiang/tool/makedepend106/bin:/usr/local/cuda-11.0/bin:/home/data/userhome/liqiang/tool/vscode/VSCode-linux-x64/bin:/home/data/userhome/liqiang/tool/swig402/bin:/home/data/userhome/liqiang/lab/package/cmake-3.18.3-Linux-x86_64/bin:/home/data/userhome/liqiang/tool/cmake-3.18.3-Linux-x86_64/bin:home/data/userhome/liqiang/lab/hpvm-release/hpvm/build:/home/data/userhome/liqiang/tool/pycharm-community-2021.3/bin:/home/data/userhome/liqiang/Downloads/GmSSl/bin:/home/data/userhome/liqiang/Downloads/valgrind-3.19.0/bin:/home/data/userhome/liqiang/Downloads/cmake-3.23.1/bin:/home/data/userhome/liqiang/Downloads/qemu-riscv/bin:/home/data/userhome/liqiang/Documents:/home/data/userhome/liqiang/tool/anaconda3/envs/accelsim/bin:/home/data/userhome/liqiang/tool/anaconda3/condabin:/home/data/userhome/liqiang/Downloads/rar:/home/data/userhome/liqiang/Downloads/verilator/bin:/home/data/userhome/liqiang/.cargo/bin:/home/data/userhome/liqiang/.vscode-server/bin/dc96b837cf6bb4af9cd736aa3af08cf8279f7685/bin/remote-cli:/usr/local/cuda-10.1/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/user/Documents/llvm/bin:/usr/local/cuda-10.1/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/data/userhome/wanli/Downloads/clion-2021.2.3/bin:/newlib/bin:/linux/bin
doing
doing: export OPENCL_REMOTE_GPU_HOST=REPLACE_REMOTE_HOST
------------------
Contents of /home/data/userhome/liqiang/lab/gpu/accel-sim-framework/util/job_launching/../../sim_run_11.0/l1_bw_32f_unroll_large/NO_ARGS/TITANRTX-SASS-IPOLY-GTO-256B-FCFS/l1_bw_32f_unroll_large-NO_ARGS.accelsim-commit-2260456ea5e6a1420f5734f145a4b7d8ab1d4737_modified_0.0.e574
------------------
GPGPU-Sim ** ERROR: Unknown Option: '-memory_partition_indexing'
------------------
**********************************************************
Sleeping for 30s
accel-sim-framework: release gpgpu-sim: dev
Only one config, TITANRTX-SASS successfully execute simulation, and the others all failed. So How to fix the errors here? If would be so great if you give me some hints.
Finally, I see the mistakes in repo, in ./accel-sim-framework/util/joblaunching/configs/define-standard-cfgs.yml: the parameters should add "gpgpu" prefix like: LINEAR: extra_params: "-gpgpu_memory_partition_indexing 0" IPOLY: extra_params: "-gpgpu_memory_partition_indexing 2"
However, the simulation of all IPOLY-related trials failed. I try to run /home/data/userhome/liqiang/lab/gpu/accel-sim-framework/sim_run_11.0/l1_bw_32f/NO_ARGS/TITANRTX-SASS-IPOLY-GTO-32B-FRFCFS/justrun.sh, it show:
Accel-Sim [build accelsim-commit-2260456ea5e6a1420f5734f145a4b7d8ab1d4737_modified_4.0]
*** GPGPU-Sim Simulator Version 4.2.0 [build gpgpu-sim_git-commit-6aa7ed16_modified_4.0] ***
GPGPU-Sim: Configuration options:
-save_embedded_ptx 0 # saves ptx files embedded in binary as <n>.ptx
-keep 0 # keep intermediate files created by GPGPU-Sim when interfacing with external programs
-gpgpu_ptx_save_converted_ptxplus 0 # Saved converted ptxplus to a file
-gpgpu_occupancy_sm_number 75 # The SM number to pass to ptxas when getting register usage for computing GPU occupancy. This parameter is required in the config.
-ptx_opcode_latency_int 4,4,4,4,21 # Opcode latencies for integers <ADD,MAX,MUL,MAD,DIV,SHFL>Default 1,1,19,25,145,32
-ptx_opcode_latency_fp 4,4,4,4,39 # Opcode latencies for single precision floating points <ADD,MAX,MUL,MAD,DIV>Default 1,1,1,1,30
-ptx_opcode_latency_dp 54,54,54,54,330 # Opcode latencies for double precision floating points <ADD,MAX,MUL,MAD,DIV>Default 8,8,8,8,335
-ptx_opcode_latency_sfu 21 # Opcode latencies for SFU instructionsDefault 8
-ptx_opcode_latency_tesnor 32 # Opcode latencies for Tensor instructionsDefault 64
-ptx_opcode_initiation_int 2,2,2,2,2 # Opcode initiation intervals for integers <ADD,MAX,MUL,MAD,DIV,SHFL>Default 1,1,4,4,32,4
-ptx_opcode_initiation_fp 2,2,2,2,4 # Opcode initiation intervals for single precision floating points <ADD,MAX,MUL,MAD,DIV>Default 1,1,1,1,5
-ptx_opcode_initiation_dp 64,64,64,64,130 # Opcode initiation intervals for double precision floating points <ADD,MAX,MUL,MAD,DIV>Default 8,8,8,8,130
-ptx_opcode_initiation_sfu 8 # Opcode initiation intervals for sfu instructionsDefault 8
-ptx_opcode_initiation_tensor 32 # Opcode initiation intervals for tensor instructionsDefault 64
-cdp_latency 7200,8000,100,12000,1600 # CDP API latency <cudaStreamCreateWithFlags, cudaGetParameterBufferV2_init_perWarp, cudaGetParameterBufferV2_perKernel, cudaLaunchDeviceV2_init_perWarp, cudaLaunchDevicV2_perKernel>Default 7200,8000,100,12000,1600
-network_mode 2 # Interconnection network mode
-inter_config_file mesh # Interconnection network config file
-icnt_in_buffer_limit 512 # in_buffer_limit
-icnt_out_buffer_limit 512 # out_buffer_limit
-icnt_subnets 2 # subnets
-icnt_arbiter_algo 1 # arbiter_algo
-icnt_verbose 0 # inct_verbose
-icnt_grant_cycles 1 # grant_cycles
-gpgpu_ptx_use_cuobjdump 1 # Use cuobjdump to extract ptx and sass from binaries
-gpgpu_experimental_lib_support 0 # Try to extract code from cuda libraries [Broken because of unknown cudaGetExportTable]
-checkpoint_option 0 # checkpointing flag (0 = no checkpoint)
-checkpoint_kernel 1 # checkpointing during execution of which kernel (1- 1st kernel)
-checkpoint_CTA 0 # checkpointing after # of CTA (< less than total CTA)
-resume_option 0 # resume flag (0 = no resume)
-resume_kernel 0 # Resume from which kernel (1= 1st kernel)
-resume_CTA 0 # resume from which CTA
-checkpoint_CTA_t 0 # resume from which CTA
-checkpoint_insn_Y 0 # resume from which CTA
-gpgpu_ptx_convert_to_ptxplus 0 # Convert SASS (native ISA) to ptxplus and run ptxplus
-gpgpu_ptx_force_max_capability 75 # Force maximum compute capability
-gpgpu_ptx_inst_debug_to_file 0 # Dump executed instructions' debug information to file
-gpgpu_ptx_inst_debug_file inst_debug.txt # Executed instructions' debug output file
-gpgpu_ptx_inst_debug_thread_uid 1 # Thread UID for executed instructions' debug output
-gpgpu_simd_model 1 # 1 = post-dominator
-gpgpu_shader_core_pipeline 1024:32 # shader core pipeline config, i.e., {<nthread>:<warpsize>}
-gpgpu_tex_cache:l1 N:4:128:256,L:R:m:N:L,T:512:8,128:2 # per-shader L1 texture cache (READ-ONLY) config {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>:<rf>}
-gpgpu_const_cache:l1 N:128:64:8,L:R:f:N:L,S:2:64,4 # per-shader L1 constant memory cache (READ-ONLY) config {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>}
-gpgpu_cache:il1 N:64:128:16,L:R:f:N:L,S:2:48,4 # shader L1 instruction cache config {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>}
-gpgpu_cache:dl1 S:4:128:64,L:T:m:L:L,A:256:32,16:0,32 # per-shader L1 data cache config {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | none}
-gpgpu_l1_cache_write_ratio 25 # L1D write ratio
-gpgpu_l1_banks 4 # The number of L1 cache banks
-gpgpu_l1_banks_byte_interleaving 32 # l1 banks byte interleaving granularity
-gpgpu_l1_banks_hashing_function 0 # l1 banks hashing function
-gpgpu_l1_latency 32 # L1 Hit Latency
-gpgpu_smem_latency 30 # smem Latency
-gpgpu_cache:dl1PrefL1 none # per-shader L1 data cache config {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | none}
-gpgpu_cache:dl1PrefShared none # per-shader L1 data cache config {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | none}
-gpgpu_gmem_skip_L1D 0 # global memory access skip L1D cache (implements -Xptxas -dlcm=cg, default=no skip)
-gpgpu_perfect_mem 0 # enable perfect memory mode (no cache miss)
-n_regfile_gating_group 4 # group of lanes that should be read/written together)
-gpgpu_clock_gated_reg_file 0 # enable clock gated reg file for power calculations
-gpgpu_clock_gated_lanes 0 # enable clock gated lanes for power calculations
-gpgpu_shader_registers 65536 # Number of registers per shader core. Limits number of concurrent CTAs. (default 8192)
-gpgpu_registers_per_block 65536 # Maximum number of registers per CTA. (default 8192)
-gpgpu_ignore_resources_limitation 0 # gpgpu_ignore_resources_limitation (default 0)
-gpgpu_shader_cta 16 # Maximum number of concurrent CTAs in shader (default 32)
-gpgpu_num_cta_barriers 16 # Maximum number of named barriers per CTA (default 16)
-gpgpu_n_clusters 72 # number of processing clusters
-gpgpu_n_cores_per_cluster 1 # number of simd cores per cluster
-gpgpu_n_cluster_ejection_buffer_size 32 # number of packets in ejection buffer
-gpgpu_n_ldst_response_buffer_size 2 # number of response packets in ld/st unit ejection buffer
-gpgpu_shmem_per_block 49152 # Size of shared memory per thread block or CTA (default 48kB)
-gpgpu_shmem_size 65536 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_option 0,8,16,32,64,64 # Option list of shared memory sizes
-gpgpu_unified_l1d_size 128 # Size of unified data cache(L1D + shared memory) in KB
-gpgpu_adaptive_cache_config 1 # adaptive_cache_config
-gpgpu_shmem_sizeDefault 65536 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_size_PrefL1 16384 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_size_PrefShared 16384 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_num_banks 32 # Number of banks in the shared memory in each shader core (default 16)
-gpgpu_shmem_limited_broadcast 0 # Limit shared memory to do one broadcast per cycle (default on)
-gpgpu_shmem_warp_parts 1 # Number of portions a warp is divided into for shared memory bank conflict check
-gpgpu_mem_unit_ports 1 # The number of memory transactions allowed per core cycle
-gpgpu_shmem_warp_parts 1 # Number of portions a warp is divided into for shared memory bank conflict check
-gpgpu_warpdistro_shader -1 # Specify which shader core to collect the warp size distribution from
-gpgpu_warp_issue_shader 0 # Specify which shader core to collect the warp issue distribution from
-gpgpu_local_mem_map 1 # Mapping from local memory space address to simulated GPU physical address space (default = enabled)
-gpgpu_num_reg_banks 16 # Number of register banks (default = 8)
-gpgpu_reg_bank_use_warp_id 0 # Use warp ID in mapping registers to banks (default = off)
-gpgpu_sub_core_model 1 # Sub Core Volta/Pascal model (default = off)
-gpgpu_enable_specialized_operand_collector 0 # enable_specialized_operand_collector
-gpgpu_operand_collector_num_units_sp 4 # number of collector units (default = 4)
-gpgpu_operand_collector_num_units_dp 0 # number of collector units (default = 0)
-gpgpu_operand_collector_num_units_sfu 4 # number of collector units (default = 4)
-gpgpu_operand_collector_num_units_int 0 # number of collector units (default = 0)
-gpgpu_operand_collector_num_units_tensor_core 4 # number of collector units (default = 4)
-gpgpu_operand_collector_num_units_mem 2 # number of collector units (default = 2)
-gpgpu_operand_collector_num_units_gen 8 # number of collector units (default = 0)
-gpgpu_operand_collector_num_in_ports_sp 1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_dp 0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_in_ports_sfu 1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_int 0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_in_ports_tensor_core 1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_mem 1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_gen 8 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_out_ports_sp 1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_dp 0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_out_ports_sfu 1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_int 0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_out_ports_tensor_core 1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_mem 1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_gen 8 # number of collector unit in ports (default = 0)
-gpgpu_coalesce_arch 75 # Coalescing arch (GT200 = 13, Fermi = 20)
-gpgpu_num_sched_per_core 4 # Number of warp schedulers per core
-gpgpu_max_insn_issue_per_warp 1 # Max number of instructions that can be issued per warp in one cycle by scheduler (either 1 or 2)
-gpgpu_dual_issue_diff_exec_units 1 # should dual issue use two different execution unit resources (Default = 1)
-gpgpu_simt_core_sim_order 1 # Select the simulation order of cores in a cluster (0=Fix, 1=Round-Robin)
-gpgpu_pipeline_widths 4,4,4,4,4,4,4,4,4,4,8,4,4 # Pipeline widths ID_OC_SP,ID_OC_DP,ID_OC_INT,ID_OC_SFU,ID_OC_MEM,OC_EX_SP,OC_EX_DP,OC_EX_INT,OC_EX_SFU,OC_EX_MEM,EX_WB,ID_OC_TENSOR_CORE,OC_EX_TENSOR_CORE
-gpgpu_tensor_core_avail 1 # Tensor Core Available (default=0)
-gpgpu_num_sp_units 4 # Number of SP units (default=1)
-gpgpu_num_dp_units 4 # Number of DP units (default=0)
-gpgpu_num_int_units 4 # Number of INT units (default=0)
-gpgpu_num_sfu_units 4 # Number of SF units (default=1)
-gpgpu_num_tensor_core_units 4 # Number of tensor_core units (default=1)
-gpgpu_num_mem_units 1 # Number if ldst units (default=1) WARNING: not hooked up to anything
-gpgpu_scheduler lrr # Scheduler configuration: < lrr | gto | two_level_active > If two_level_active:<num_active_warps>:<inner_prioritization>:<outer_prioritization>For complete list of prioritization values see shader.h enum scheduler_prioritization_typeDefault: gto
-gpgpu_concurrent_kernel_sm 0 # Support concurrent kernels on a SM (default = disabled)
-gpgpu_perfect_inst_const_cache 1 # perfect inst and const cache mode, so all inst and const hits in the cache(default = disabled)
-gpgpu_inst_fetch_throughput 4 # the number of fetched intruction per warp each cycle
-gpgpu_reg_file_port_throughput 2 # the number ports of the register file
-specialized_unit_1 1,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_2 1,4,200,4,4,TEX # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_3 1,4,2,4,4,TENSOR # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_4 1,4,4,4,4,UDP # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_5 0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_6 0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_7 0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_8 0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-gpgpu_perf_sim_memcpy 1 # Fill the L2 cache on memcpy
-gpgpu_simple_dram_model 0 # simple_dram_model with fixed latency and BW
-gpgpu_dram_scheduler 1 # 0 = fifo, 1 = FR-FCFS (defaul)
-gpgpu_dram_partition_queues 64:64:64:64 # i2$:$2d:d2$:$2i
-l2_ideal 0 # Use a ideal L2 cache that always hit
-gpgpu_cache:dl2 S:512:128:16,L:B:m:L:X,A:192:4,32:0,32 # unified banked L2 data cache config {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>}
-gpgpu_cache:dl2_texture_only 0 # L2 cache used for texture only
-gpgpu_n_mem 3 # number of memory modules (e.g. memory controllers) in gpu
-gpgpu_n_sub_partition_per_mchannel 2 # number of memory subpartition in each memory module
-gpgpu_n_mem_per_ctrlr 1 # number of memory chips per memory controller
-gpgpu_memlatency_stat 14 # track and display latency statistics 0x2 enables MC, 0x4 enables queue logs
-gpgpu_frfcfs_dram_sched_queue_size 64 # 0 = unlimited (default); # entries per chip
-gpgpu_dram_return_queue_size 192 # 0 = unlimited (default); # entries per chip
-gpgpu_dram_buswidth 16 # default = 4 bytes (8 bytes per cycle at DDR)
-gpgpu_dram_burst_length 2 # Burst length of each DRAM request (default = 4 data bus cycle)
-dram_data_command_freq_ratio 2 # Frequency ratio between DRAM data bus and command bus (default = 2 times, i.e. DDR)
-gpgpu_dram_timing_opt nbk=16:CCD=1:RRD=29:RCD=99:RAS=232:RP=99:RC=330:CL=99:WL=15:CDLR=22:WR=85:nbkgrp=4:CCDL=15:RTPL=29 # DRAM timing parameters = {nbk:tCCD:tRRD:tRCD:tRAS:tRP:tRC:CL:WL:tCDLR:tWR:nbkgrp:tCCDL:tRTPL}
-gpgpu_l2_rop_latency 198 # ROP queue latency (default 85)
-dram_latency 94 # DRAM latency (default 30)
-dram_dual_bus_interface 1 # dual_bus_interface (default = 0)
-dram_bnk_indexing_policy 0 # dram_bnk_indexing_policy (0 = normal indexing, 1 = Xoring with the higher bits) (Default = 0)
-dram_bnkgrp_indexing_policy 1 # dram_bnkgrp_indexing_policy (0 = take higher bits, 1 = take lower bits) (Default = 0)
-dram_seperate_write_queue_enable 0 # Seperate_Write_Queue_Enable
-dram_write_queue_size 32:28:16 # Write_Queue_Size
-dram_elimnate_rw_turnaround 0 # elimnate_rw_turnaround i.e set tWTR and tRTW = 0
-icnt_flit_size 40 # icnt_flit_size
-gpgpu_mem_addr_mapping dramid@5;00000000.00000000.00000000.00000000.0000RRRR.RRRRRRRR.RBBBCCCC.BCCSSSSS # mapping memory address to dram model {dramid@<start bit>;<memory address map>}
-gpgpu_mem_addr_test 0 # run sweep test to check address mapping for aliased address
-gpgpu_mem_address_mask 1 # 0 = old addressing mask, 1 = new addressing mask, 2 = new add. mask + flipped bank sel and chip sel bits
-gpgpu_memory_partition_indexing 2 # 0 = no indexing, 1 = bitwise xoring, 2 = IPoly, 3 = custom indexing
-accelwattch_xml_file accelwattch_sass_sim.xml # AccelWattch XML file
-power_simulation_enabled 0 # Turn on power simulator (1=On, 0=Off)
-power_per_cycle_dump 0 # Dump detailed power output each cycle
-hw_perf_file_name hw_perf.csv # Hardware Performance Statistics file
-hw_perf_bench_name # Kernel Name in Hardware Performance Statistics file
-power_simulation_mode 0 # Switch performance counter input for power simulation (0=Sim, 1=HW, 2=HW-Sim Hybrid)
-dvfs_enabled 0 # Turn on DVFS for power model
-aggregate_power_stats 0 # Accumulate power across all kernels
-accelwattch_hybrid_perfsim_L1_RH 0 # Get L1 Read Hits for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_L1_RM 0 # Get L1 Read Misses for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_L1_WH 0 # Get L1 Write Hits for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_L1_WM 0 # Get L1 Write Misses for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_L2_RH 0 # Get L2 Read Hits for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_L2_RM 0 # Get L2 Read Misses for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_L2_WH 0 # Get L2 Write Hits for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_L2_WM 0 # Get L2 Write Misses for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_CC_ACC 0 # Get Constant Cache Acesses for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_SHARED_ACC 0 # Get Shared Memory Acesses for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_DRAM_RD 0 # Get DRAM Reads for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_DRAM_WR 0 # Get DRAM Writes for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_NOC 0 # Get Interconnect Acesses for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_PIPE_DUTY 0 # Get Pipeline Duty Cycle Acesses for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_NUM_SM_IDLE 0 # Get Number of Idle SMs for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_CYCLES 0 # Get Executed Cycles for Accelwattch-Hybrid from Accel-Sim
-accelwattch_hybrid_perfsim_VOLTAGE 0 # Get Chip Voltage for Accelwattch-Hybrid from Accel-Sim
-power_trace_enabled 0 # produce a file for the power trace (1=On, 0=Off)
-power_trace_zlevel 6 # Compression level of the power trace output log (0=no comp, 9=highest)
-steady_power_levels_enabled 0 # produce a file for the steady power levels (1=On, 0=Off)
-steady_state_definition 8:4 # allowed deviation:number of samples
-gpgpu_max_cycle 0 # terminates gpu simulation early (0 = no limit)
-gpgpu_max_insn 0 # terminates gpu simulation early (0 = no limit)
-gpgpu_max_cta 0 # terminates gpu simulation early (0 = no limit)
-gpgpu_max_completed_cta 0 # terminates gpu simulation early (0 = no limit)
-gpgpu_runtime_stat 500 # display runtime statistics such as dram utilization {<freq>:<flag>}
-liveness_message_freq 1 # Minimum number of seconds between simulation liveness messages (0 = always print)
-gpgpu_compute_capability_major 7 # Major compute capability version number
-gpgpu_compute_capability_minor 5 # Minor compute capability version number
-gpgpu_flush_l1_cache 1 # Flush L1 cache at the end of each kernel call
-gpgpu_flush_l2_cache 0 # Flush L2 cache at the end of each kernel call
-gpgpu_deadlock_detect 1 # Stop the simulation at deadlock (1=on (default), 0=off)
-gpgpu_ptx_instruction_classification 0 # if enabled will classify ptx instruction types per kernel (Max 255 kernels now)
-gpgpu_ptx_sim_mode 0 # Select between Performance (default) or Functional simulation (1)
-gpgpu_clock_domains 1200:1200:1200:7001 # Clock Domain Frequencies in MhZ {<Core Clock>:<ICNT Clock>:<L2 Clock>:<DRAM Clock>}
-gpgpu_max_concurrent_kernel 32 # maximum kernels that can run concurrently on GPU, set this value according to max resident grids for your compute capability
-gpgpu_cflog_interval 0 # Interval between each snapshot in control flow logger
-visualizer_enabled 0 # Turn on visualizer output (1=On, 0=Off)
-visualizer_outputfile NULL # Specifies the output log file for visualizer
-visualizer_zlevel 6 # Compression level of the visualizer output log (0=no comp, 9=highest)
-gpgpu_stack_size_limit 1024 # GPU thread stack size
-gpgpu_heap_size_limit 8388608 # GPU malloc heap size
-gpgpu_runtime_sync_depth_limit 2 # GPU device runtime synchronize depth
-gpgpu_runtime_pending_launch_count_limit 2048 # GPU device runtime pending launch count
-trace_enabled 0 # Turn on traces
-trace_components none # comma seperated list of traces to enable. Complete list found in trace_streams.tup. Default none
-trace_sampling_core 0 # The core which is printed using CORE_DPRINTF. Default 0
-trace_sampling_memory_partition -1 # The memory partition which is printed using MEMPART_DPRINTF. Default -1 (i.e. all)
-enable_ptx_file_line_stats 1 # Turn on PTX source line statistic profiling. (1 = On)
-ptx_line_stats_filename gpgpu_inst_stats.txt # Output file for PTX source line statistics.
-gpgpu_kernel_launch_latency 7027 # Kernel launch latency in cycles. Default: 0
-gpgpu_cdp_enabled 0 # Turn on CDP
-gpgpu_TB_launch_latency 0 # thread block launch latency in cycles. Default: 0
-trace ./traces/kernelslist.g # traces kernel filetraces kernel file directory
-trace_opcode_latency_initiation_int 4,2 # Opcode latencies and initiation for integers in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_sp 4,2 # Opcode latencies and initiation for sp in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_dp 54,64 # Opcode latencies and initiation for dp in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_sfu 21,8 # Opcode latencies and initiation for sfu in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_tensor 2,2 # Opcode latencies and initiation for tensor in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_spec_op_1 4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_2 200,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_3 2,2 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_4 4,1 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_5 4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_6 4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_7 4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_8 4,4 # specialized unit config <latency,initiation>
DRAM Timing Options:
nbk 16 # number of banks
CCD 1 # column to column delay
RRD 29 # minimal delay between activation of rows in different banks
RCD 99 # row to column delay
RAS 232 # time needed to activate row
RP 99 # time needed to precharge (deactivate) row
RC 330 # row cycle time
CDLR 22 # switching from write to read (changes tWTR)
WR 85 # last data-in to row precharge
CL 99 # CAS latency
WL 15 # Write latency
nbkgrp 4 # number of bank groups
CCDL 15 # column to column delay between accesses to different bank groups
RTPL 29 # read to precharge delay between accesses to different bank groups
Total number of memory sub partition = 6
addr_dec_mask[CHIP] = 0000000000000000 high:64 low:0
addr_dec_mask[BK] = 0000000000007080 high:15 low:7
addr_dec_mask[ROW] = 000000000fff8000 high:28 low:15
addr_dec_mask[COL] = 0000000000000f7f high:12 low:0
addr_dec_mask[BURST] = 000000000000001f high:5 low:0
sub_partition_id_mask = 0000000000000080
GPGPU-Sim uArch: clock freqs: 1200000000.000000:1200000000.000000:1200000000.000000:7001000000.000000
GPGPU-Sim uArch: clock periods: 0.00000000083333333333:0.00000000083333333333:0.00000000083333333333:0.00000000014283673761
*** Initializing Memory Statistics ***
GPGPU-Sim uArch: performance model initialization complete.
Processing kernel ./traces/kernel-1.traceg
-kernel name = _Z6l1_latPjS_PmS0_
-kernel id = 1
-grid dim = (1,1,1)
-block dim = (1,1,1)
-shmem = 0
-nregs = 24
-binary version = 70
-cuda stream id = 0
-shmem base_addr = 0x00007f587c000000
-local mem base_addr = 0x00007f587e000000
-nvbit version = 1.5.3
-accelsim tracer version = 3
Header info loaded for kernel command : ./traces/kernel-1.traceg
launching kernel name: _Z6l1_latPjS_PmS0_ uid: 1
GPGPU-Sim uArch: Shader 0 bind to kernel 1 '_Z6l1_latPjS_PmS0_'
GPGPU-Sim uArch: CTA/core = 16, limited by: cta_limit
GPGPU-Sim: Reconfigure L1 cache to 128KB
thread block = 0,0,0
accel-sim.out: hashing.cc:89: unsigned int ipoly_hash_function(new_addr_type, unsigned int, unsigned int): Assertion `"\nmemory_partition_indexing error: The number of " "channels should be " "16, 32 or 64 for the hashing IPOLY index function. other banks " "numbers are not supported. Generate it by yourself! \n" && 0' failed.
./justrun.sh: line 1: 3267182 Aborted (core dumped) /home/data/userhome/liqiang/lab/gpu/accel-sim-framework/util/job_launching/../../sim_run_11.0/gpgpu-sim-builds/accelsim-commit-2260456ea5e6a1420f5734f145a4b7d8ab1d4737_modified_4.0/accel-sim.out -config ./gpgpusim.config -trace ./traces/kernelslist.g
which indicates I have wrong bank number setting, I want to know which config in my config file should be revised?
First, don't mix release/dev. If you are using dev for gpgpu-sim then use dev for accel-sim as well.
The last line just tells you the problem.
accel-sim.out: hashing.cc:89: unsigned int ipoly_hash_function(new_addr_type, unsigned int, unsigned int): Assertion `"\nmemory_partition_indexing error: The number of " "channels should be " "16, 32 or 64 for the hashing IPOLY index function. other banks " "numbers are not supported. Generate it by yourself! \n" && 0' failed.
Why is gpgpu_n_mem
only 3 in your config? Is this expected?
the gpgpu_n_mem parameter is generated by tuner.py, it is wrong for TITAN RTX. In white paper of Turing arch, the number of memory controllers of TU102 (TITAN RTX) is 12, same as config of RTX2060. So, How come the tuner.py use GPU_Microbenchmark to generate a gpgpu_n_mem of 3? here is the stats.txt obtained by runing
./GPU_Microbenchmark/run_all.sh | tee stats.txt
, the output stats.txt file shows:
/////////////////////////////////
running ./mem_config microbenchmark
Global memory size = 24 GB
Memory Clock rate = 7001 Mhz
Memory Bus Width = 384 bit
Memory type = HBM
Memory channels = 3
//Accel_Sim config:
-gpgpu_n_mem 3
-gpgpu_n_mem_per_ctrlr 1
-gpgpu_dram_buswidth 16
-gpgpu_dram_burst_length 2
-dram_data_command_freq_ratio 2
-dram_dual_bus_interface 1
-gpgpu_dram_timing_opt nbk=16:CCD=1:RRD=29:RCD=99:RAS=232:RP=99:RC=330:CL=99:WL=15:CDLR=22:WR=85:nbkgrp=4:CCDL=15:RTPL=29
why the testbench give us a mistaken number? here is my newest hw_def file :
// TITAN RTX HW def file
// based on TU102
#ifndef TURING_TITANRTX_DEF_H
#define TURING_TITANRTX_DEF_H
#include "./common/common.h"
#include "./common/deviceQuery.h"
#define L1_SIZE (96 * 1024) // Max L1 size in bytes, NVIDIA-Turing-Architecture-Whitepaper page13
#define CLK_FREQUENCY 1350 // Base frequency in MHz (Boost clock can go up to 1770 MHz)
#define ISSUE_MODEL issue_model::single // single issue core or dual issue
#define CORE_MODEL core_model::subcore // subcore model or shared model:
#define DRAM_MODEL dram_model::GDDR6 // memory type; checked
#define WARP_SCHEDS_PER_SM 4 // number of warp schedulers per SM; NVIDIA-Turing-Architecture-Whitepaper page17, each processing block has one, each SM has four processing blocks.
// number of SASS HMMA per 16x16 PTX WMMA for FP16 - FP32 accumlate operation
#define SASS_hmma_per_PTX_wmma 4
// These vars are almost constant between HW generation
// see slide 24 from Nvidia at
// https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21730-inside-the-nvidia-ampere-architecture.pdf
#define L2_BANKS_PER_MEM_CHANNEL 2 //6 L2 banks, 3 memory channels
#define L2_BANK_WIDTH_in_BYTE 32 //32*6bank =192 L2 Cache BW
#endif
Hi, accel-sim devolopers:
$ nvidia-smi Tue Jul 9 16:45:54 2024
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 TITAN RTX Off | 00000000:3B:00.0 Off | N/A | | 44% 41C P0 58W / 280W | 0MiB / 24220MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 TITAN RTX Off | 00000000:5E:00.0 Off | N/A | | 49% 39C P0 57W / 280W | 0MiB / 24220MiB | 1% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 TITAN RTX Off | 00000000:B1:00.0 Off | N/A | | 38% 36C P0 63W / 280W | 0MiB / 24220MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 TITAN RTX Off | 00000000:D9:00.0 Off | N/A | | 22% 36C P0 39W / 280W | 0MiB / 24220MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
/home/data/userhome/liqiang/lab/gpu/accel-sim-framework/util/job_launching/configs/define-standard-cfgs.yml
Basefile Configs
Pascal
TITANX: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM6_TITANX/gpgpusim.config"
TITANXX: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/TITANX-pascal/gpgpusim.config"
Kepler
TITANK: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM3_KEPLER_TITAN/gpgpusim.config"
Ampere RTX 3070
RTX3070: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM86_RTX3070/gpgpusim.config"
Turing
RTX2060: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM75_RTX2060/gpgpusim.config"
Turing
RTX2060_S: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM75_RTX2060_S/gpgpusim.config"
Volta
TITANV: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM7_TITANV/gpgpusim.config"
Volta
TITANV_OLD: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM7_TITANV_OLD/gpgpusim.config"
QV100: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM7_QV100/gpgpusim.config"
QV100_64SM: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM7_QV100_SMs/gpgpusim.config"
QV100_SASS: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM7_QV100_SASS/gpgpusim.config"
QV100_old: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM7_QV100_old/gpgpusim.config"
Fermi
GTX480: base_file: "$GPGPUSIM_ROOT/configs/tested-cfgs/SM2_GTX480/gpgpusim.config"
To keep your configurations straight - we recommend specifying
If you are using SASS or PTX in the config:
For example: QV100-SASS or QV100-PTX.
SASS: extra_params: "#SASS-Driven Accel-Sim"
PTX: extra_params: "#PTX-Driven GPGPU-Sim"