Dealing with non 4B aligned address

mahmoodn commented 3 years ago

After creating a trace for a workload, I see this assertion error in shader.cc:

assert(
        localaddr % 4 ==
        0);  // Address must be 4B aligned - required if accessing 4B per
             // request, otherwise access will overflow into next thread's space

based on my debug, the localaddr is 376922898 or 0x16776312 which is not 4B aligned because 376922898/4=94230724.5. I am confused with a question that who's fault is that? Is that the tracer or the gpgpusim? I haven't figured out that. Any thought on that?

mkhairy commented 3 years ago

could you please send me the blame instruction trace (i.e. the memory instruction line in the trace file that causes this error)?

mahmoodn commented 3 years ago

I see some print functions, but wasn't able to find out how to enable them.

void shd_warp_t::print_ibuffer(FILE *fout) const {
  fprintf(fout, "  ibuffer[%2u] : ", m_warp_id);
  for (unsigned i = 0; i < IBUFFER_SIZE; i++) {
    const inst_t *inst = m_ibuffer[i].m_inst;
    if (inst)
      inst->print_insn(fout);
    else if (m_ibuffer[i].m_valid)
      fprintf(fout, " <invalid instruction> ");
    else
      fprintf(fout, " <empty> ");
  }
  fprintf(fout, "\n");
}

The command I use is

./gpu-simulator/bin/debug/accel-sim.out -trace /path/to/kernelslist.g -config ./gpu-simulator/gpgpu-sim/configs/tested-cfgs/SM86_RTX3070/gpgpusim.config -config ./gpu-simulator/configs/tested-cfgs/SM86_RTX3070/trace.config

mahmoodn commented 3 years ago

I tried to use .gdbinit macros and according to this output, the file has been read by GDB. However, as you can see no debug information are written on the screen. I also searched for newly created log file but couldn't find that.

$ gdb --args ~/accel-sim-framework/gpu-simulator/bin/debug/accel-sim.out -config ./gpgpusim.config -trace ./traces/kernelslist.g
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/mahmood/accel-sim-framework/gpu-simulator/bin/debug/accel-sim.out...

  ** loading GPGPU-Sim debugging macros... **

(gdb) r
Starting program: /home/mahmood/accel-sim-framework/gpu-simulator/bin/debug/accel-sim.out -config ./gpgpusim.config -trace ./traces/kernelslist.g
warning: the debug information found in "/lib64/ld-2.31.so" does not match "/lib64/ld-linux-x86-64.so.2" (CRC mismatch).

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Accel-Sim [build accelsim-commit-16c3068431c7bf46d425aac134947ceb3e6a8f42_modified_11.0]

        *** GPGPU-Sim Simulator Version 4.1.0  [build gpgpu-sim_git-commit-6ad461a95ac71e0597274c4f750ce03bb3a6871e_modified_3.0] ***

GPGPU-Sim: Configuration options:

-save_embedded_ptx                      0 # saves ptx files embedded in binary as <n>.ptx
-keep                                   0 # keep intermediate files created by GPGPU-Sim when interfacing with external programs
-gpgpu_ptx_save_converted_ptxplus                    0 # Saved converted ptxplus to a file
-gpgpu_occupancy_sm_number                   86 # The SM number to pass to ptxas when getting register usage for computing GPU occupancy. This parameter is required in the config.
-ptx_opcode_latency_int           4,4,4,4,21 # Opcode latencies for integers <ADD,MAX,MUL,MAD,DIV,SHFL>Default 1,1,19,25,145,32
-ptx_opcode_latency_fp           4,4,4,4,39 # Opcode latencies for single precision floating points <ADD,MAX,MUL,MAD,DIV>Default 1,1,1,1,30
-ptx_opcode_latency_dp      64,64,64,64,330 # Opcode latencies for double precision floating points <ADD,MAX,MUL,MAD,DIV>Default 8,8,8,8,335
-ptx_opcode_latency_sfu                   21 # Opcode latencies for SFU instructionsDefault 8
-ptx_opcode_latency_tesnor                   64 # Opcode latencies for Tensor instructionsDefault 64
-ptx_opcode_initiation_int            2,2,2,2,2 # Opcode initiation intervals for integers <ADD,MAX,MUL,MAD,DIV,SHFL>Default 1,1,4,4,32,4
-ptx_opcode_initiation_fp            1,1,1,1,2 # Opcode initiation intervals for single precision floating points <ADD,MAX,MUL,MAD,DIV>Default 1,1,1,1,5
-ptx_opcode_initiation_dp      64,64,64,64,130 # Opcode initiation intervals for double precision floating points <ADD,MAX,MUL,MAD,DIV>Default 8,8,8,8,130
-ptx_opcode_initiation_sfu                    8 # Opcode initiation intervals for sfu instructionsDefault 8
-ptx_opcode_initiation_tensor                   64 # Opcode initiation intervals for tensor instructionsDefault 64
-cdp_latency         7200,8000,100,12000,1600 # CDP API latency <cudaStreamCreateWithFlags, cudaGetParameterBufferV2_init_perWarp, cudaGetParameterBufferV2_perKernel, cudaLaunchDeviceV2_init_perWarp, cudaLaunchDevicV2_perKernel>Default 7200,8000,100,12000,1600
-network_mode                           2 # Interconnection network mode
-inter_config_file                   mesh # Interconnection network config file
-icnt_in_buffer_limit                  512 # in_buffer_limit
-icnt_out_buffer_limit                  512 # out_buffer_limit
-icnt_subnets                           2 # subnets
-icnt_arbiter_algo                      1 # arbiter_algo
-icnt_verbose                           0 # inct_verbose
-icnt_grant_cycles                      1 # grant_cycles
-gpgpu_ptx_use_cuobjdump                    1 # Use cuobjdump to extract ptx and sass from binaries
-gpgpu_experimental_lib_support                    0 # Try to extract code from cuda libraries [Broken because of unknown cudaGetExportTable]
-checkpoint_option                      0 #  checkpointing flag (0 = no checkpoint)
-checkpoint_kernel                      1 #  checkpointing during execution of which kernel (1- 1st kernel)
-checkpoint_CTA                         0 #  checkpointing after # of CTA (< less than total CTA)
-resume_option                          0 #  resume flag (0 = no resume)
-resume_kernel                          0 #  Resume from which kernel (1= 1st kernel)
-resume_CTA                             0 #  resume from which CTA
-checkpoint_CTA_t                       0 #  resume from which CTA
-checkpoint_insn_Y                      0 #  resume from which CTA
-gpgpu_ptx_convert_to_ptxplus                    0 # Convert SASS (native ISA) to ptxplus and run ptxplus
-gpgpu_ptx_force_max_capability                   86 # Force maximum compute capability
-gpgpu_ptx_inst_debug_to_file                    0 # Dump executed instructions' debug information to file
-gpgpu_ptx_inst_debug_file       inst_debug.txt # Executed instructions' debug output file
-gpgpu_ptx_inst_debug_thread_uid                    1 # Thread UID for executed instructions' debug output
-gpgpu_simd_model                       1 # 1 = post-dominator
-gpgpu_shader_core_pipeline              1536:32 # shader core pipeline config, i.e., {<nthread>:<warpsize>}
-gpgpu_tex_cache:l1  N:4:128:256,L:R:m:N:L,T:512:8,128:2 # per-shader L1 texture cache  (READ-ONLY) config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>:<rf>}
-gpgpu_const_cache:l1 N:128:64:8,L:R:f:N:L,S:2:64,4 # per-shader L1 constant memory cache  (READ-ONLY) config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>}
-gpgpu_cache:il1     N:64:128:16,L:R:f:N:L,S:2:48,4 # shader L1 instruction cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>}
-gpgpu_cache:dl1     S:4:128:256,L:T:m:L:L,A:384:48,16:0,32 # per-shader L1 data cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | none}
-gpgpu_l1_cache_write_ratio                   25 # L1D write ratio
-gpgpu_l1_banks                         4 # The number of L1 cache banks
-gpgpu_l1_banks_byte_interleaving                   32 # l1 banks byte interleaving granularity
-gpgpu_l1_banks_hashing_function                    0 # l1 banks hashing function
-gpgpu_l1_latency                      39 # L1 Hit Latency
-gpgpu_smem_latency                    29 # smem Latency
-gpgpu_cache:dl1PrefL1                 none # per-shader L1 data cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | none}
-gpgpu_cache:dl1PrefShared                 none # per-shader L1 data cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | none}
-gpgpu_gmem_skip_L1D                    0 # global memory access skip L1D cache (implements -Xptxas -dlcm=cg, default=no skip)
-gpgpu_perfect_mem                      0 # enable perfect memory mode (no cache miss)
-n_regfile_gating_group                    4 # group of lanes that should be read/written together)
-gpgpu_clock_gated_reg_file                    0 # enable clock gated reg file for power calculations
-gpgpu_clock_gated_lanes                    0 # enable clock gated lanes for power calculations
-gpgpu_shader_registers                65536 # Number of registers per shader core. Limits number of concurrent CTAs. (default 8192)
-gpgpu_registers_per_block                65536 # Maximum number of registers per CTA. (default 8192)
-gpgpu_ignore_resources_limitation                    0 # gpgpu_ignore_resources_limitation (default 0)
-gpgpu_shader_cta                      32 # Maximum number of concurrent CTAs in shader (default 8)
-gpgpu_num_cta_barriers                   16 # Maximum number of named barriers per CTA (default 16)
-gpgpu_n_clusters                      46 # number of processing clusters
-gpgpu_n_cores_per_cluster                    1 # number of simd cores per cluster
-gpgpu_n_cluster_ejection_buffer_size                   32 # number of packets in ejection buffer
-gpgpu_n_ldst_response_buffer_size                    2 # number of response packets in ld/st unit ejection buffer
-gpgpu_shmem_per_block                49152 # Size of shared memory per thread block or CTA (default 48kB)
-gpgpu_shmem_size                  102400 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_option      0,8,16,32,64,100 # Option list of shared memory sizes
-gpgpu_unified_l1d_size                  128 # Size of unified data cache(L1D + shared memory) in KB
-gpgpu_adaptive_cache_config                    1 # adaptive_cache_config
-gpgpu_shmem_sizeDefault               102400 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_size_PrefL1                16384 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_size_PrefShared                16384 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_num_banks                   32 # Number of banks in the shared memory in each shader core (default 16)
-gpgpu_shmem_limited_broadcast                    0 # Limit shared memory to do one broadcast per cycle (default on)
-gpgpu_shmem_warp_parts                    1 # Number of portions a warp is divided into for shared memory bank conflict check
-gpgpu_mem_unit_ports                    1 # The number of memory transactions allowed per core cycle
-gpgpu_shmem_warp_parts                    1 # Number of portions a warp is divided into for shared memory bank conflict check
-gpgpu_warpdistro_shader                   -1 # Specify which shader core to collect the warp size distribution from
-gpgpu_warp_issue_shader                    0 # Specify which shader core to collect the warp issue distribution from
-gpgpu_local_mem_map                    1 # Mapping from local memory space address to simulated GPU physical address space (default = enabled)
-gpgpu_num_reg_banks                    8 # Number of register banks (default = 8)
-gpgpu_reg_bank_use_warp_id                    0 # Use warp ID in mapping registers to banks (default = off)
-gpgpu_sub_core_model                    1 # Sub Core Volta/Pascal model (default = off)
-gpgpu_enable_specialized_operand_collector                    0 # enable_specialized_operand_collector
-gpgpu_operand_collector_num_units_sp                    4 # number of collector units (default = 4)
-gpgpu_operand_collector_num_units_dp                    0 # number of collector units (default = 0)
-gpgpu_operand_collector_num_units_sfu                    4 # number of collector units (default = 4)
-gpgpu_operand_collector_num_units_int                    0 # number of collector units (default = 0)
-gpgpu_operand_collector_num_units_tensor_core                    4 # number of collector units (default = 4)
-gpgpu_operand_collector_num_units_mem                    2 # number of collector units (default = 2)
-gpgpu_operand_collector_num_units_gen                    8 # number of collector units (default = 0)
-gpgpu_operand_collector_num_in_ports_sp                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_dp                    0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_in_ports_sfu                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_int                    0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_in_ports_tensor_core                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_mem                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_gen                    8 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_out_ports_sp                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_dp                    0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_out_ports_sfu                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_int                    0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_out_ports_tensor_core                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_mem                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_gen                    8 # number of collector unit in ports (default = 0)
-gpgpu_coalesce_arch                   86 # Coalescing arch (GT200 = 13, Fermi = 20)
-gpgpu_num_sched_per_core                    4 # Number of warp schedulers per core
-gpgpu_max_insn_issue_per_warp                    1 # Max number of instructions that can be issued per warp in one cycle by scheduler (either 1 or 2)
-gpgpu_dual_issue_diff_exec_units                    1 # should dual issue use two different execution unit resources (Default = 1)
-gpgpu_simt_core_sim_order                    1 # Select the simulation order of cores in a cluster (0=Fix, 1=Round-Robin)
-gpgpu_pipeline_widths 4,4,4,4,4,4,4,4,4,4,8,4,4 # Pipeline widths ID_OC_SP,ID_OC_DP,ID_OC_INT,ID_OC_SFU,ID_OC_MEM,OC_EX_SP,OC_EX_DP,OC_EX_INT,OC_EX_SFU,OC_EX_MEM,EX_WB,ID_OC_TENSOR_CORE,OC_EX_TENSOR_CORE
-gpgpu_tensor_core_avail                    1 # Tensor Core Available (default=0)
-gpgpu_num_sp_units                     4 # Number of SP units (default=1)
-gpgpu_num_dp_units                     4 # Number of DP units (default=0)
-gpgpu_num_int_units                    4 # Number of INT units (default=0)
-gpgpu_num_sfu_units                    4 # Number of SF units (default=1)
-gpgpu_num_tensor_core_units                    4 # Number of tensor_core units (default=1)
-gpgpu_num_mem_units                    1 # Number if ldst units (default=1) WARNING: not hooked up to anything
-gpgpu_scheduler                      gto # Scheduler configuration: < lrr | gto | two_level_active > If two_level_active:<num_active_warps>:<inner_prioritization>:<outer_prioritization>For complete list of prioritization values see shader.h enum scheduler_prioritization_typeDefault: gto
-gpgpu_concurrent_kernel_sm                    0 # Support concurrent kernels on a SM (default = disabled)
-gpgpu_perfect_inst_const_cache                    1 # perfect inst and const cache mode, so all inst and const hits in the cache(default = disabled)
-gpgpu_inst_fetch_throughput                    4 # the number of fetched intruction per warp each cycle
-gpgpu_reg_file_port_throughput                    2 # the number ports of the register file
-specialized_unit_1         1,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_2       1,4,200,4,4,TEX # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_3     1,4,32,4,4,TENSOR # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_4         1,4,4,4,4,UDP # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_5         0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_6         0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_7         0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_8         0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-gpgpu_perf_sim_memcpy                    1 # Fill the L2 cache on memcpy
-gpgpu_simple_dram_model                    0 # simple_dram_model with fixed latency and BW
-gpgpu_dram_scheduler                    1 # 0 = fifo, 1 = FR-FCFS (defaul)
-gpgpu_dram_partition_queues          64:64:64:64 # i2$:$2d:d2$:$2i
-l2_ideal                               0 # Use a ideal L2 cache that always hit
-gpgpu_cache:dl2     S:64:128:16,L:B:m:L:P,A:192:4,32:0,32 # unified banked L2 data cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>}
-gpgpu_cache:dl2_texture_only                    0 # L2 cache used for texture only
-gpgpu_n_mem                           16 # number of memory modules (e.g. memory controllers) in gpu
-gpgpu_n_sub_partition_per_mchannel                    2 # number of memory subpartition in each memory module
-gpgpu_n_mem_per_ctrlr                    1 # number of memory chips per memory controller
-gpgpu_memlatency_stat                   14 # track and display latency statistics 0x2 enables MC, 0x4 enables queue logs
-gpgpu_frfcfs_dram_sched_queue_size                   64 # 0 = unlimited (default); # entries per chip
-gpgpu_dram_return_queue_size                  192 # 0 = unlimited (default); # entries per chip
-gpgpu_dram_buswidth                    2 # default = 4 bytes (8 bytes per cycle at DDR)
-gpgpu_dram_burst_length                   16 # Burst length of each DRAM request (default = 4 data bus cycle)
-dram_data_command_freq_ratio                    4 # Frequency ratio between DRAM data bus and command bus (default = 2 times, i.e. DDR)
-gpgpu_dram_timing_opt nbk=16:CCD=4:RRD=12:RCD=24:RAS=55:RP=24:RC=78:CL=24:WL=8:CDLR=10:WR=24:nbkgrp=4:CCDL=6:RTPL=4 # DRAM timing parameters = {nbk:tCCD:tRRD:tRCD:tRAS:tRP:tRC:CL:WL:tCDLR:tWR:nbkgrp:tCCDL:tRTPL}
-gpgpu_l2_rop_latency                  187 # ROP queue latency (default 85)
-dram_latency                         254 # DRAM latency (default 30)
-dram_dual_bus_interface                    0 # dual_bus_interface (default = 0)
-dram_bnk_indexing_policy                    0 # dram_bnk_indexing_policy (0 = normal indexing, 1 = Xoring with the higher bits) (Default = 0)
-dram_bnkgrp_indexing_policy                    1 # dram_bnkgrp_indexing_policy (0 = take higher bits, 1 = take lower bits) (Default = 0)
-dram_seperate_write_queue_enable                    0 # Seperate_Write_Queue_Enable
-dram_write_queue_size             32:28:16 # Write_Queue_Size
-dram_elimnate_rw_turnaround                    0 # elimnate_rw_turnaround i.e set tWTR and tRTW = 0
-icnt_flit_size                        40 # icnt_flit_size
-gpgpu_mem_addr_mapping dramid@8;00000000.00000000.00000000.00000000.0000RRRR.RRRRRRRR.RBBBCCCC.BCCSSSSS # mapping memory address to dram model {dramid@<start bit>;<memory address map>}
-gpgpu_mem_addr_test                    0 # run sweep test to check address mapping for aliased address
-gpgpu_mem_address_mask                    1 # 0 = old addressing mask, 1 = new addressing mask, 2 = new add. mask + flipped bank sel and chip sel bits
-gpgpu_memory_partition_indexing                    2 # 0 = no indexing, 1 = bitwise xoring, 2 = IPoly, 3 = custom indexing
-gpuwattch_xml_file         gpuwattch.xml # GPUWattch XML file
-power_simulation_enabled                    0 # Turn on power simulator (1=On, 0=Off)
-power_per_cycle_dump                    0 # Dump detailed power output each cycle
-power_trace_enabled                    0 # produce a file for the power trace (1=On, 0=Off)
-power_trace_zlevel                     6 # Compression level of the power trace output log (0=no comp, 9=highest)
-steady_power_levels_enabled                    0 # produce a file for the steady power levels (1=On, 0=Off)
-steady_state_definition                  8:4 # allowed deviation:number of samples
-gpgpu_max_cycle                        0 # terminates gpu simulation early (0 = no limit)
-gpgpu_max_insn                         0 # terminates gpu simulation early (0 = no limit)
-gpgpu_max_cta                          0 # terminates gpu simulation early (0 = no limit)
-gpgpu_max_completed_cta                    0 # terminates gpu simulation early (0 = no limit)
-gpgpu_runtime_stat                   500 # display runtime statistics such as dram utilization {<freq>:<flag>}
-liveness_message_freq                    1 # Minimum number of seconds between simulation liveness messages (0 = always print)
-gpgpu_compute_capability_major                    8 # Major compute capability version number
-gpgpu_compute_capability_minor                    6 # Minor compute capability version number
-gpgpu_flush_l1_cache                    1 # Flush L1 cache at the end of each kernel call
-gpgpu_flush_l2_cache                    0 # Flush L2 cache at the end of each kernel call
-gpgpu_deadlock_detect                    1 # Stop the simulation at deadlock (1=on (default), 0=off)
-gpgpu_ptx_instruction_classification                    0 # if enabled will classify ptx instruction types per kernel (Max 255 kernels now)
-gpgpu_ptx_sim_mode                     0 # Select between Performance (default) or Functional simulation (1)
-gpgpu_clock_domains 1132:1132:1132:3500.5 # Clock Domain Frequencies in MhZ {<Core Clock>:<ICNT Clock>:<L2 Clock>:<DRAM Clock>}
-gpgpu_max_concurrent_kernel                    8 # maximum kernels that can run concurrently on GPU
-gpgpu_cflog_interval                    0 # Interval between each snapshot in control flow logger
-visualizer_enabled                     0 # Turn on visualizer output (1=On, 0=Off)
-visualizer_outputfile                 NULL # Specifies the output log file for visualizer
-visualizer_zlevel                      6 # Compression level of the visualizer output log (0=no comp, 9=highest)
-gpgpu_stack_size_limit                 1024 # GPU thread stack size
-gpgpu_heap_size_limit              8388608 # GPU malloc heap size
-gpgpu_runtime_sync_depth_limit                    2 # GPU device runtime synchronize depth
-gpgpu_runtime_pending_launch_count_limit                 2048 # GPU device runtime pending launch count
-trace_enabled                          0 # Turn on traces
-trace_components                    none # comma seperated list of traces to enable. Complete list found in trace_streams.tup. Default none
-trace_sampling_core                    0 # The core which is printed using CORE_DPRINTF. Default 0
-trace_sampling_memory_partition                   -1 # The memory partition which is printed using MEMPART_DPRINTF. Default -1 (i.e. all)
-enable_ptx_file_line_stats                    1 # Turn on PTX source line statistic profiling. (1 = On)
-ptx_line_stats_filename gpgpu_inst_stats.txt # Output file for PTX source line statistics.
-gpgpu_kernel_launch_latency                 5000 # Kernel launch latency in cycles. Default: 0
-gpgpu_cdp_enabled                      0 # Turn on CDP
-gpgpu_TB_launch_latency                    0 # thread block launch latency in cycles. Default: 0
-trace               ./traces/kernelslist.g # traces kernel filetraces kernel file directory
-trace_opcode_latency_initiation_int                  4,2 # Opcode latencies and initiation for integers in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_sp                  4,1 # Opcode latencies and initiation for sp in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_dp                64,64 # Opcode latencies and initiation for dp in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_sfu                 21,8 # Opcode latencies and initiation for sfu in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_tensor                32,32 # Opcode latencies and initiation for tensor in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_spec_op_1                  4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_2                200,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_3                32,32 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_4                  4,1 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_5                  4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_6                  4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_7                  4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_8                  4,4 # specialized unit config <latency,initiation>
DRAM Timing Options:
nbk                                    16 # number of banks
CCD                                     4 # column to column delay
RRD                                    12 # minimal delay between activation of rows in different banks
RCD                                    24 # row to column delay
RAS                                    55 # time needed to activate row
RP                                     24 # time needed to precharge (deactivate) row
RC                                     78 # row cycle time
CDLR                                   10 # switching from write to read (changes tWTR)
WR                                     24 # last data-in to row precharge
CL                                     24 # CAS latency
WL                                      8 # Write latency
nbkgrp                                  4 # number of bank groups
CCDL                                    6 # column to column delay between accesses to different bank groups
RTPL                                    4 # read to precharge delay between accesses to different bank groups
Total number of memory sub partition = 32
addr_dec_mask[CHIP]  = 0000000000000f00         high:12 low:8
addr_dec_mask[BK]    = 0000000000070080         high:19 low:7
addr_dec_mask[ROW]   = 00000000fff80000         high:32 low:19
addr_dec_mask[COL]   = 000000000000f07f         high:16 low:0
addr_dec_mask[BURST] = 000000000000001f         high:5 low:0
sub_partition_id_mask = 0000000000000080
GPGPU-Sim uArch: clock freqs: 1132000000.000000:1132000000.000000:1132000000.000000:3500500000.000000
GPGPU-Sim uArch: clock periods: 0.00000000088339222615:0.00000000088339222615:0.00000000088339222615:0.00000000028567347522
*** Initializing Memory Statistics ***
GPGPU-Sim uArch: performance model initialization complete.
launching memcpy command : MemcpyHtoD,0x00007f75a0000000,95789392
launching memcpy command : MemcpyHtoD,0x00007f7592000000,230834496
launching memcpy command : MemcpyHtoD,0x00007f758c000000,95789392
launching memcpy command : MemcpyHtoD,0x00007f757e000000,230834496
launching memcpy command : MemcpyHtoD,0x00007f761c000000,23947347
launching memcpy command : MemcpyHtoD,0x00007f757a000000,57708624
launching memcpy command : MemcpyHtoD,0x00007f7638101000,28
launching memcpy command : MemcpyHtoD,0x00007f7638101600,491520
launching memcpy command : MemcpyHtoD,0x00007f7638100000,688
launching memcpy command : MemcpyHtoD,0x00007f7638100600,4
launching memcpy command : MemcpyHtoD,0x00007f7638100800,4
launching memcpy command : MemcpyHtoD,0x00007f7638101000,28
Processing kernel ./traces/kernel-4.traceg
-kernel name = foo
-kernel id = 4
-grid dim = (1,1,1)
-block dim = (512,1,1)
-shmem = 0
-nregs = 18
-binary version = 86
-cuda stream id = 93840529697616
-shmem base_addr = 0x00007f767f000000
-local mem base_addr = 0x00007f767d000000
-nvbit version = 1.5.3
-accelsim tracer version = 3
launching kernel command : ./traces/kernel-4.traceg
GPGPU-Sim uArch: Shader 0 bind to kernel 1 'foo'
GPGPU-Sim uArch: CTA/core = 3, limited by: threads
GPGPU-Sim: Reconfigure L1 cache to 128KB
thread block = 0,0,0
accel-sim.out: shader.cc:1656: unsigned int shader_core_ctx::translate_local_memaddr(address_type, unsigned int, unsigned int, unsigned int, new_addr_type*): Assertion `localaddr % 4 == 0' failed.

Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb)

mkhairy commented 3 years ago

so, what I meant is to figure out the instruction that is causing this error. So, an easier way to do is to add a print statement here: https://github.com/accel-sim/accel-sim-framework/blob/4c2bf09a79d6b57bb10fe1898700930a5dd5531f/gpu-simulator/trace-driven/trace_driven.cc#L33

print this instruction to the std, and runs your programs, when it fails, see what the instructions are printed and see the last local load instruction is printed. You should see a local load (LDL.E.SYS) or generic memory (LD). So, please send us this instruction text line from the traces?

mahmoodn commented 3 years ago

OK. I wrote

    new_inst->parse_from_trace_struct(
        warp_traces[trace_pc], m_kernel_info->OpcodeMap,
        m_kernel_info->m_tconfig, m_kernel_info->m_kernel_trace_info);
    std::cout << "trace_pc=" << trace_pc 
              << std::hex << " warp_traces[trace_pc].m_pc=" << warp_traces[trace_pc].m_pc 
              << " warp_traces[trace_pc].opcode=" << warp_traces[trace_pc].opcode 
              << " warp_traces[trace_pc].reg_dest[0]=" << warp_traces[trace_pc].reg_dest[0] 
              << " -> ";

    if (warp_traces[trace_pc].memadd_info != NULL) {
        for (int i = 0; i < warp_traces[trace_pc].memadd_info->width; i++) {
            std::cout << std::hex << warp_traces[trace_pc].memadd_info->addrs[i] << " ";
        }
    }
    std::cout << "\n";

    trace_pc++;

In the output I see

trace_pc=4 warp_traces[trace_pc].m_pc=40 warp_traces[trace_pc].opcode=IMAD warp_traces[trace_pc].reg_dest[0]=0 -> 
trace_pc=5 warp_traces[trace_pc].m_pc=50 warp_traces[trace_pc].opcode=ISETP.GE.U32.AND warp_traces[trace_pc].reg_dest[0]=0 -> 
trace_pc=6 warp_traces[trace_pc].m_pc=60 warp_traces[trace_pc].opcode=STL warp_traces[trace_pc].reg_dest[0]=0 -> 16776312 16776312 16776312 16776312 
trace_pc=7 warp_traces[trace_pc].m_pc=70 warp_traces[trace_pc].opcode=EXIT warp_traces[trace_pc].reg_dest[0]=0 -> 
*** Program received signal SIGABRT (Aborted) ***
[Mahmood:shader.cc:1655] 376922898
accel-sim.out: shader.cc:1657: unsigned int shader_core_ctx::translate_local_memaddr(address_type, unsigned int, unsigned int, unsigned int, new_addr_type*): Assertion `localaddr % 4 == 0' failed.

Please note that 0x16776312 is the same as 376922898. In the traceg file, the instruction is

-kernel id = 4
-grid dim = (1,1,1)
-block dim = (512,1,1)
-shmem = 0
-nregs = 18
-binary version = 86
-cuda stream id = -685638256
-shmem base_addr = 0x00007f9e58000000
-local mem base_addr = 0x00007f9e56000000
-nvbit version = 1.5.3
-accelsim tracer version = 3

#traces format = threadblock_x threadblock_y threadblock_z warpid_tb PC mask dest_num [reg_dests] opcode src_num [reg_srcs] mem_width [adrrescompress?] [mem_addresses]

#BEGIN_TB

thread block = 0,0,0

warp = 0
insts = 41
0000 ffffffff 1 R1 IMAD.MOV.U32 2 R255 R255 0 
0010 ffffffff 1 R0 S2R 0 0 
0020 ffffffff 1 R1 IADD3 2 R1 R255 0 
0030 ffffffff 1 R3 S2R 0 0 
0040 ffffffff 1 R0 IMAD 2 R0 R3 0 
0050 ffffffff 0 ISETP.GE.U32.AND 1 R0 0 
0060 ffffffff 0 STL 2 R1 R0 4 1 0xfffc78 0 
0070 fffffffe 0 EXIT 0 0

mahmoodn commented 3 years ago

@mkhairy You may want to see the traceg file, too. In the kernelslist.g file, you may want to remove kernel-1, 2 and 3 in order to keep kernel-4.traceg. Files are available here. I send the files because I think the memory allocation addresses affect the simulator. If you have any tip for further debugging, please let me know and I will check that.

mahmoodn commented 3 years ago

Sorry, I think I wrongly send the kernel trace. I rerun the tracer and saw that the STL has a data with non 4B address. I will try with the nvbit itself to see if there is a problem with the tracer tool that post processes the nvbit result, or there is a problem with nvbit itself. For now, I close the issue. Thanks for the feedbacks.

accel-sim / accel-sim-framework

Dealing with non 4B aligned address #59