accel-sim / accel-sim-framework

This is the top-level repository for the Accel-Sim framework.
https://accel-sim.github.io
Other
303 stars 118 forks source link

Segmentation fault on 3080 #49

Closed mahmoodn closed 3 years ago

mahmoodn commented 3 years ago

Hi With the dev branch, I have created a trace for parboil/sgemm on 3080 and everything looks fine. However, the following simulator command fails with a segmentation fault.

$ ./gpu-simulator/bin/release/accel-sim.out -trace ./hw_run/traces/device-0/11.2/sgemm/_i__home_mnaderan_test_input_matrix1_txt__home_mnaderan_test_input_matrix2t_txt__home_mnaderan_test_input_matrix2t_txt__o__home_mnaderan_test_output_matrix3_txt/traces/kernelslist.g -config ./gpu-simulator/gpgpu-sim/configs/tested-cfgs/SM86_RTX3070/gpgpusim.config -config ./gpu-simulator/configs/tested-cfgs/SM86_RTX3070/trace.config
Accel-Sim [build accelsim-commit-16c3068431c7bf46d425aac134947ceb3e6a8f42_modified_3.0]

        *** GPGPU-Sim Simulator Version 4.1.0  [build gpgpu-sim_git-commit-6ad461a95ac71e0597274c4f750ce03bb3a6871e_modified_0.0] ***

GPGPU-Sim: Configuration options:

-save_embedded_ptx                      0 # saves ptx files embedded in binary as <n>.ptx
-keep                                   0 # keep intermediate files created by GPGPU-Sim when interfacing with external programs
-gpgpu_ptx_save_converted_ptxplus                    0 # Saved converted ptxplus to a file
-gpgpu_occupancy_sm_number                   86 # The SM number to pass to ptxas when getting register usage for computing GPU occupancy. This parameter is required in the config.
-ptx_opcode_latency_int           4,4,4,4,21 # Opcode latencies for integers <ADD,MAX,MUL,MAD,DIV,SHFL>Default 1,1,19,25,145,32
-ptx_opcode_latency_fp           4,4,4,4,39 # Opcode latencies for single precision floating points <ADD,MAX,MUL,MAD,DIV>Default 1,1,1,1,30
-ptx_opcode_latency_dp      64,64,64,64,330 # Opcode latencies for double precision floating points <ADD,MAX,MUL,MAD,DIV>Default 8,8,8,8,335
-ptx_opcode_latency_sfu                   21 # Opcode latencies for SFU instructionsDefault 8
-ptx_opcode_latency_tesnor                   64 # Opcode latencies for Tensor instructionsDefault 64
-ptx_opcode_initiation_int            2,2,2,2,2 # Opcode initiation intervals for integers <ADD,MAX,MUL,MAD,DIV,SHFL>Default 1,1,4,4,32,4
-ptx_opcode_initiation_fp            1,1,1,1,2 # Opcode initiation intervals for single precision floating points <ADD,MAX,MUL,MAD,DIV>Default 1,1,1,1,5
-ptx_opcode_initiation_dp      64,64,64,64,130 # Opcode initiation intervals for double precision floating points <ADD,MAX,MUL,MAD,DIV>Default 8,8,8,8,130
-ptx_opcode_initiation_sfu                    8 # Opcode initiation intervals for sfu instructionsDefault 8
-ptx_opcode_initiation_tensor                   64 # Opcode initiation intervals for tensor instructionsDefault 64
-cdp_latency         7200,8000,100,12000,1600 # CDP API latency <cudaStreamCreateWithFlags, cudaGetParameterBufferV2_init_perWarp, cudaGetParameterBufferV2_perKernel, cudaLaunchDeviceV2_init_perWarp, cudaLaunchDevicV2_perKernel>Default 7200,8000,100,12000,1600
-network_mode                           2 # Interconnection network mode
-inter_config_file                   mesh # Interconnection network config file
-icnt_in_buffer_limit                  512 # in_buffer_limit
-icnt_out_buffer_limit                  512 # out_buffer_limit
-icnt_subnets                           2 # subnets
-icnt_arbiter_algo                      1 # arbiter_algo
-icnt_verbose                           0 # inct_verbose
-icnt_grant_cycles                      1 # grant_cycles
-gpgpu_ptx_use_cuobjdump                    1 # Use cuobjdump to extract ptx and sass from binaries
-gpgpu_experimental_lib_support                    0 # Try to extract code from cuda libraries [Broken because of unknown cudaGetExportTable]
-checkpoint_option                      0 #  checkpointing flag (0 = no checkpoint)
-checkpoint_kernel                      1 #  checkpointing during execution of which kernel (1- 1st kernel)
-checkpoint_CTA                         0 #  checkpointing after # of CTA (< less than total CTA)
-resume_option                          0 #  resume flag (0 = no resume)
-resume_kernel                          0 #  Resume from which kernel (1= 1st kernel)
-resume_CTA                             0 #  resume from which CTA
-checkpoint_CTA_t                       0 #  resume from which CTA
-checkpoint_insn_Y                      0 #  resume from which CTA
-gpgpu_ptx_convert_to_ptxplus                    0 # Convert SASS (native ISA) to ptxplus and run ptxplus
-gpgpu_ptx_force_max_capability                   86 # Force maximum compute capability
-gpgpu_ptx_inst_debug_to_file                    0 # Dump executed instructions' debug information to file
-gpgpu_ptx_inst_debug_file       inst_debug.txt # Executed instructions' debug output file
-gpgpu_ptx_inst_debug_thread_uid                    1 # Thread UID for executed instructions' debug output
-gpgpu_simd_model                       1 # 1 = post-dominator
-gpgpu_shader_core_pipeline              1536:32 # shader core pipeline config, i.e., {<nthread>:<warpsize>}
-gpgpu_tex_cache:l1  N:4:128:256,L:R:m:N:L,T:512:8,128:2 # per-shader L1 texture cache  (READ-ONLY) config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>:<rf>}
-gpgpu_const_cache:l1 N:128:64:8,L:R:f:N:L,S:2:64,4 # per-shader L1 constant memory cache  (READ-ONLY) config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>}
-gpgpu_cache:il1     N:64:128:16,L:R:f:N:L,S:2:48,4 # shader L1 instruction cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>}
-gpgpu_cache:dl1     S:4:128:256,L:T:m:L:L,A:384:48,16:0,32 # per-shader L1 data cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | none}
-gpgpu_l1_cache_write_ratio                   25 # L1D write ratio
-gpgpu_l1_banks                         4 # The number of L1 cache banks
-gpgpu_l1_banks_byte_interleaving                   32 # l1 banks byte interleaving granularity
-gpgpu_l1_banks_hashing_function                    0 # l1 banks hashing function
-gpgpu_l1_latency                      39 # L1 Hit Latency
-gpgpu_smem_latency                    29 # smem Latency
-gpgpu_cache:dl1PrefL1                 none # per-shader L1 data cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | none}
-gpgpu_cache:dl1PrefShared                 none # per-shader L1 data cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | none}
-gpgpu_gmem_skip_L1D                    0 # global memory access skip L1D cache (implements -Xptxas -dlcm=cg, default=no skip)
-gpgpu_perfect_mem                      0 # enable perfect memory mode (no cache miss)
-n_regfile_gating_group                    4 # group of lanes that should be read/written together)
-gpgpu_clock_gated_reg_file                    0 # enable clock gated reg file for power calculations
-gpgpu_clock_gated_lanes                    0 # enable clock gated lanes for power calculations
-gpgpu_shader_registers                65536 # Number of registers per shader core. Limits number of concurrent CTAs. (default 8192)
-gpgpu_registers_per_block                65536 # Maximum number of registers per CTA. (default 8192)
-gpgpu_ignore_resources_limitation                    0 # gpgpu_ignore_resources_limitation (default 0)
-gpgpu_shader_cta                      32 # Maximum number of concurrent CTAs in shader (default 8)
-gpgpu_num_cta_barriers                   16 # Maximum number of named barriers per CTA (default 16)
-gpgpu_n_clusters                      46 # number of processing clusters
-gpgpu_n_cores_per_cluster                    1 # number of simd cores per cluster
-gpgpu_n_cluster_ejection_buffer_size                   32 # number of packets in ejection buffer
-gpgpu_n_ldst_response_buffer_size                    2 # number of response packets in ld/st unit ejection buffer
-gpgpu_shmem_per_block                49152 # Size of shared memory per thread block or CTA (default 48kB)
-gpgpu_shmem_size                  102400 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_option      0,8,16,32,64,100 # Option list of shared memory sizes
-gpgpu_unified_l1d_size                  128 # Size of unified data cache(L1D + shared memory) in KB
-gpgpu_adaptive_cache_config                    1 # adaptive_cache_config
-gpgpu_shmem_sizeDefault               102400 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_size_PrefL1                16384 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_size_PrefShared                16384 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_num_banks                   32 # Number of banks in the shared memory in each shader core (default 16)
-gpgpu_shmem_limited_broadcast                    0 # Limit shared memory to do one broadcast per cycle (default on)
-gpgpu_shmem_warp_parts                    1 # Number of portions a warp is divided into for shared memory bank conflict check
-gpgpu_mem_unit_ports                    1 # The number of memory transactions allowed per core cycle
-gpgpu_shmem_warp_parts                    1 # Number of portions a warp is divided into for shared memory bank conflict check
-gpgpu_warpdistro_shader                   -1 # Specify which shader core to collect the warp size distribution from
-gpgpu_warp_issue_shader                    0 # Specify which shader core to collect the warp issue distribution from
-gpgpu_local_mem_map                    1 # Mapping from local memory space address to simulated GPU physical address space (default = enabled)
-gpgpu_num_reg_banks                    8 # Number of register banks (default = 8)
-gpgpu_reg_bank_use_warp_id                    0 # Use warp ID in mapping registers to banks (default = off)
-gpgpu_sub_core_model                    1 # Sub Core Volta/Pascal model (default = off)
-gpgpu_enable_specialized_operand_collector                    0 # enable_specialized_operand_collector
-gpgpu_operand_collector_num_units_sp                    4 # number of collector units (default = 4)
-gpgpu_operand_collector_num_units_dp                    0 # number of collector units (default = 0)
-gpgpu_operand_collector_num_units_sfu                    4 # number of collector units (default = 4)
-gpgpu_operand_collector_num_units_int                    0 # number of collector units (default = 0)
-gpgpu_operand_collector_num_units_tensor_core                    4 # number of collector units (default = 4)
-gpgpu_operand_collector_num_units_mem                    2 # number of collector units (default = 2)
-gpgpu_operand_collector_num_units_gen                    8 # number of collector units (default = 0)
-gpgpu_operand_collector_num_in_ports_sp                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_dp                    0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_in_ports_sfu                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_int                    0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_in_ports_tensor_core                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_mem                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_gen                    8 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_out_ports_sp                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_dp                    0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_out_ports_sfu                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_int                    0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_out_ports_tensor_core                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_mem                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_gen                    8 # number of collector unit in ports (default = 0)
-gpgpu_coalesce_arch                   86 # Coalescing arch (GT200 = 13, Fermi = 20)
-gpgpu_num_sched_per_core                    4 # Number of warp schedulers per core
-gpgpu_max_insn_issue_per_warp                    1 # Max number of instructions that can be issued per warp in one cycle by scheduler (either 1 or 2)
-gpgpu_dual_issue_diff_exec_units                    1 # should dual issue use two different execution unit resources (Default = 1)
-gpgpu_simt_core_sim_order                    1 # Select the simulation order of cores in a cluster (0=Fix, 1=Round-Robin)
-gpgpu_pipeline_widths 4,4,4,4,4,4,4,4,4,4,8,4,4 # Pipeline widths ID_OC_SP,ID_OC_DP,ID_OC_INT,ID_OC_SFU,ID_OC_MEM,OC_EX_SP,OC_EX_DP,OC_EX_INT,OC_EX_SFU,OC_EX_MEM,EX_WB,ID_OC_TENSOR_CORE,OC_EX_TENSOR_CORE
-gpgpu_tensor_core_avail                    1 # Tensor Core Available (default=0)
-gpgpu_num_sp_units                     4 # Number of SP units (default=1)
-gpgpu_num_dp_units                     4 # Number of DP units (default=0)
-gpgpu_num_int_units                    4 # Number of INT units (default=0)
-gpgpu_num_sfu_units                    4 # Number of SF units (default=1)
-gpgpu_num_tensor_core_units                    4 # Number of tensor_core units (default=1)
-gpgpu_num_mem_units                    1 # Number if ldst units (default=1) WARNING: not hooked up to anything
-gpgpu_scheduler                      gto # Scheduler configuration: < lrr | gto | two_level_active > If two_level_active:<num_active_warps>:<inner_prioritization>:<outer_prioritization>For complete list of prioritization values see shader.h enum scheduler_prioritization_typeDefault: gto
-gpgpu_concurrent_kernel_sm                    0 # Support concurrent kernels on a SM (default = disabled)
-gpgpu_perfect_inst_const_cache                    1 # perfect inst and const cache mode, so all inst and const hits in the cache(default = disabled)
-gpgpu_inst_fetch_throughput                    4 # the number of fetched intruction per warp each cycle
-gpgpu_reg_file_port_throughput                    2 # the number ports of the register file
-specialized_unit_1         1,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_2       1,4,200,4,4,TEX # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_3     1,4,32,4,4,TENSOR # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_4         1,4,4,4,4,UDP # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_5         0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_6         0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_7         0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_8         0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-gpgpu_perf_sim_memcpy                    1 # Fill the L2 cache on memcpy
-gpgpu_simple_dram_model                    0 # simple_dram_model with fixed latency and BW
-gpgpu_dram_scheduler                    1 # 0 = fifo, 1 = FR-FCFS (defaul)
-gpgpu_dram_partition_queues          64:64:64:64 # i2$:$2d:d2$:$2i
-l2_ideal                               0 # Use a ideal L2 cache that always hit
-gpgpu_cache:dl2     S:64:128:16,L:B:m:L:P,A:192:4,32:0,32 # unified banked L2 data cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>}
-gpgpu_cache:dl2_texture_only                    0 # L2 cache used for texture only
-gpgpu_n_mem                           16 # number of memory modules (e.g. memory controllers) in gpu
-gpgpu_n_sub_partition_per_mchannel                    2 # number of memory subpartition in each memory module
-gpgpu_n_mem_per_ctrlr                    1 # number of memory chips per memory controller
-gpgpu_memlatency_stat                   14 # track and display latency statistics 0x2 enables MC, 0x4 enables queue logs
-gpgpu_frfcfs_dram_sched_queue_size                   64 # 0 = unlimited (default); # entries per chip
-gpgpu_dram_return_queue_size                  192 # 0 = unlimited (default); # entries per chip
-gpgpu_dram_buswidth                    2 # default = 4 bytes (8 bytes per cycle at DDR)
-gpgpu_dram_burst_length                   16 # Burst length of each DRAM request (default = 4 data bus cycle)
-dram_data_command_freq_ratio                    4 # Frequency ratio between DRAM data bus and command bus (default = 2 times, i.e. DDR)
-gpgpu_dram_timing_opt nbk=16:CCD=4:RRD=12:RCD=24:RAS=55:RP=24:RC=78:CL=24:WL=8:CDLR=10:WR=24:nbkgrp=4:CCDL=6:RTPL=4 # DRAM timing parameters = {nbk:tCCD:tRRD:tRCD:tRAS:tRP:tRC:CL:WL:tCDLR:tWR:nbkgrp:tCCDL:tRTPL}
-gpgpu_l2_rop_latency                  187 # ROP queue latency (default 85)
-dram_latency                         254 # DRAM latency (default 30)
-dram_dual_bus_interface                    0 # dual_bus_interface (default = 0)
-dram_bnk_indexing_policy                    0 # dram_bnk_indexing_policy (0 = normal indexing, 1 = Xoring with the higher bits) (Default = 0)
-dram_bnkgrp_indexing_policy                    1 # dram_bnkgrp_indexing_policy (0 = take higher bits, 1 = take lower bits) (Default = 0)
-dram_seperate_write_queue_enable                    0 # Seperate_Write_Queue_Enable
-dram_write_queue_size             32:28:16 # Write_Queue_Size
-dram_elimnate_rw_turnaround                    0 # elimnate_rw_turnaround i.e set tWTR and tRTW = 0
-icnt_flit_size                        40 # icnt_flit_size
-gpgpu_mem_addr_mapping dramid@8;00000000.00000000.00000000.00000000.0000RRRR.RRRRRRRR.RBBBCCCC.BCCSSSSS # mapping memory address to dram model {dramid@<start bit>;<memory address map>}
-gpgpu_mem_addr_test                    0 # run sweep test to check address mapping for aliased address
-gpgpu_mem_address_mask                    1 # 0 = old addressing mask, 1 = new addressing mask, 2 = new add. mask + flipped bank sel and chip sel bits
-gpgpu_memory_partition_indexing                    2 # 0 = no indexing, 1 = bitwise xoring, 2 = IPoly, 3 = custom indexing
-gpuwattch_xml_file         gpuwattch.xml # GPUWattch XML file
-power_simulation_enabled                    0 # Turn on power simulator (1=On, 0=Off)
-power_per_cycle_dump                    0 # Dump detailed power output each cycle
-power_trace_enabled                    0 # produce a file for the power trace (1=On, 0=Off)
-power_trace_zlevel                     6 # Compression level of the power trace output log (0=no comp, 9=highest)
-steady_power_levels_enabled                    0 # produce a file for the steady power levels (1=On, 0=Off)
-steady_state_definition                  8:4 # allowed deviation:number of samples
-gpgpu_max_cycle                        0 # terminates gpu simulation early (0 = no limit)
-gpgpu_max_insn                         0 # terminates gpu simulation early (0 = no limit)
-gpgpu_max_cta                          0 # terminates gpu simulation early (0 = no limit)
-gpgpu_max_completed_cta                    0 # terminates gpu simulation early (0 = no limit)
-gpgpu_runtime_stat                   500 # display runtime statistics such as dram utilization {<freq>:<flag>}
-liveness_message_freq                    1 # Minimum number of seconds between simulation liveness messages (0 = always print)
-gpgpu_compute_capability_major                    8 # Major compute capability version number
-gpgpu_compute_capability_minor                    6 # Minor compute capability version number
-gpgpu_flush_l1_cache                    1 # Flush L1 cache at the end of each kernel call
-gpgpu_flush_l2_cache                    0 # Flush L2 cache at the end of each kernel call
-gpgpu_deadlock_detect                    1 # Stop the simulation at deadlock (1=on (default), 0=off)
-gpgpu_ptx_instruction_classification                    0 # if enabled will classify ptx instruction types per kernel (Max 255 kernels now)
-gpgpu_ptx_sim_mode                     0 # Select between Performance (default) or Functional simulation (1)
-gpgpu_clock_domains 1132:1132:1132:3500.5 # Clock Domain Frequencies in MhZ {<Core Clock>:<ICNT Clock>:<L2 Clock>:<DRAM Clock>}
-gpgpu_max_concurrent_kernel                    8 # maximum kernels that can run concurrently on GPU
-gpgpu_cflog_interval                    0 # Interval between each snapshot in control flow logger
-visualizer_enabled                     0 # Turn on visualizer output (1=On, 0=Off)
-visualizer_outputfile                 NULL # Specifies the output log file for visualizer
-visualizer_zlevel                      6 # Compression level of the visualizer output log (0=no comp, 9=highest)
-gpgpu_stack_size_limit                 1024 # GPU thread stack size
-gpgpu_heap_size_limit              8388608 # GPU malloc heap size
-gpgpu_runtime_sync_depth_limit                    2 # GPU device runtime synchronize depth
-gpgpu_runtime_pending_launch_count_limit                 2048 # GPU device runtime pending launch count
-trace_enabled                          0 # Turn on traces
-trace_components                    none # comma seperated list of traces to enable. Complete list found in trace_streams.tup. Default none
-trace_sampling_core                    0 # The core which is printed using CORE_DPRINTF. Default 0
-trace_sampling_memory_partition                   -1 # The memory partition which is printed using MEMPART_DPRINTF. Default -1 (i.e. all)
-enable_ptx_file_line_stats                    1 # Turn on PTX source line statistic profiling. (1 = On)
-ptx_line_stats_filename gpgpu_inst_stats.txt # Output file for PTX source line statistics.
-gpgpu_kernel_launch_latency                 5000 # Kernel launch latency in cycles. Default: 0
-gpgpu_cdp_enabled                      0 # Turn on CDP
-gpgpu_TB_launch_latency                    0 # thread block launch latency in cycles. Default: 0
-trace               ./hw_run/traces/device-0/11.2/sgemm/_i__home_mnaderan_test_input_matrix1_txt__home_mnaderan_test_input_matrix2t_txt__home_mnaderan_test_input_matrix2t_txt__o__home_mnaderan_test_output_matrix3_txt/traces/kernelslist.g # traces kernel filetraces kernel file directory
-trace_opcode_latency_initiation_int                  4,2 # Opcode latencies and initiation for integers in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_sp                  4,1 # Opcode latencies and initiation for sp in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_dp                64,64 # Opcode latencies and initiation for dp in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_sfu                 21,8 # Opcode latencies and initiation for sfu in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_tensor                32,32 # Opcode latencies and initiation for tensor in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_spec_op_1                  4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_2                200,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_3                32,32 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_4                  4,1 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_5                  4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_6                  4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_7                  4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_8                  4,4 # specialized unit config <latency,initiation>
DRAM Timing Options:
nbk                                    16 # number of banks
CCD                                     4 # column to column delay
RRD                                    12 # minimal delay between activation of rows in different banks
RCD                                    24 # row to column delay
RAS                                    55 # time needed to activate row
RP                                     24 # time needed to precharge (deactivate) row
RC                                     78 # row cycle time
CDLR                                   10 # switching from write to read (changes tWTR)
WR                                     24 # last data-in to row precharge
CL                                     24 # CAS latency
WL                                      8 # Write latency
nbkgrp                                  4 # number of bank groups
CCDL                                    6 # column to column delay between accesses to different bank groups
RTPL                                    4 # read to precharge delay between accesses to different bank groups
Total number of memory sub partition = 32
addr_dec_mask[CHIP]  = 0000000000000f00         high:12 low:8
addr_dec_mask[BK]    = 0000000000070080         high:19 low:7
addr_dec_mask[ROW]   = 00000000fff80000         high:32 low:19
addr_dec_mask[COL]   = 000000000000f07f         high:16 low:0
addr_dec_mask[BURST] = 000000000000001f         high:5 low:0
sub_partition_id_mask = 0000000000000080
GPGPU-Sim uArch: clock freqs: 1132000000.000000:1132000000.000000:1132000000.000000:3500500000.000000
GPGPU-Sim uArch: clock periods: 0.00000000088339222615:0.00000000088339222615:0.00000000088339222615:0.00000000028567347522
*** Initializing Memory Statistics ***
GPGPU-Sim uArch: performance model initialization complete.
launching memcpy command : MemcpyHtoD,0x00007f806c600000,4063232
launching memcpy command : MemcpyHtoD,0x00007f806ca00000,4190208
Processing kernel ./hw_run/traces/device-0/11.2/sgemm/_i__home_mnaderan_test_input_matrix1_txt__home_mnaderan_test_input_matrix2t_txt__home_mnaderan_test_input_matrix2t_txt__o__home_mnaderan_test_output_matrix3_txt/traces/kernel-1.traceg
-kernel name = _Z9mysgemmNTPKfiS0_iPfiiff
-kernel id = 1
-grid dim = (8,66,1)
-block dim = (16,8,1)
-shmem = 512
-nregs = 52
-binary version = 86
-cuda stream id = 0
-shmem base_addr = 0x00007f8094000000
-local mem base_addr = 0x00007f8092000000
-nvbit version = 1.5.3
-accelsim tracer version = 3
launching kernel command : ./hw_run/traces/device-0/11.2/sgemm/_i__home_mnaderan_test_input_matrix1_txt__home_mnaderan_test_input_matrix2t_txt__home_mnaderan_test_input_matrix2t_txt__o__home_mnaderan_test_output_matrix3_txt/traces/kernel-1.traceg
GPGPU-Sim uArch: Shader 0 bind to kernel 1 '_Z9mysgemmNTPKfiS0_iPfiiff'
GPGPU-Sim uArch: CTA/core = 9, limited by: regs
GPGPU-Sim: Reconfigure L1 cache to 120KB
thread block = 0,0,0
Segmentation fault (core dumped)

I would like to narrow the problem and see if that is related to an incomplete configuration file, or the trace or the simulator itself? NVBit 1.5.3 supports SM86 and it is fetched by the install script in accel-sim. Although I used 3070 configuration file, I doubt if that is related to the different physical trace and simulation configuration.

mahmoodn commented 3 years ago

OK. It seems that the problems is related to gcc-9.3 similar to this issue.

mahmoodn commented 3 years ago

I found that if gpgpusim is built in debug mode, there is no problem. Only in release version, it throws segmentation fault.

mkhairy commented 3 years ago

ok, looks weird. not sure why it fails with gcc 9.1. It is on our to do list. If you figure it out, feel free to PR the fix and we will accept it.

mahmoodn commented 3 years ago

Hi @mkhairy May I know the logic of this function which should return something but actually doesn't return anything?

  trace_warp_inst_t *set_kernel(trace_kernel_info_t *kernel_info) {
    m_kernel_info = kernel_info;
  }

Just want to know if it is safe to set that as void or I misunderstand something.