accel-sim / accel-sim-framework

This is the top-level repository for the Accel-Sim framework.
https://accel-sim.github.io
Other
308 stars 118 forks source link

L1D stats are not counted correctly #157

Closed mahmoodn closed 1 year ago

mahmoodn commented 1 year ago

Following this topic, I have created a simple case and attached the trace file (1 warp, 16 instructions), config files (1 SM) and the output file (showing L1D and L2 printfs). In this example, L1D stats are wrong. So, let's dig into this, first.

gpu-cache.cc

enum cache_request_status tag_array::access(new_addr_type addr, unsigned time,
                                            unsigned &idx, bool &wb,
                                            evicted_block_info &evicted,
                                            mem_fetch *mf) {
  m_access++;
  is_used = true;
  printf("[gpu-cache.cc] %llu %llx\n",    time,    mf->get_addr());
  fflush(stdout);

and

l2cache.cc

      if (!output_full && port_free) {
        std::list<cache_event> events;
        enum cache_request_status status =
            m_L2cache->access(mf->get_addr(), mf,
                              m_gpu->gpu_sim_cycle + m_gpu->gpu_tot_sim_cycle +
                                  m_memcpy_cycle_offset,
                              events);
        printf("[l2ache.cc] %llu %llx\n",
                m_gpu->gpu_sim_cycle + m_gpu->gpu_tot_sim_cycle,      mf->get_addr());
        fflush(stdout);

As you can see in the output file,

L1I_cache:
    L1I_total_cache_accesses = 0
L1D_cache:
    L1D_cache_core[0]: Access = 12, Miss = 8, Miss_rate = 0.667, Pending_hits = 0, Reservation_fails = 0
    L1D_total_cache_accesses = 12
    L1D_total_cache_misses = 8
    L1D_total_cache_miss_rate = 0.6667
    L1D_total_cache_pending_hits = 0
    L1D_total_cache_reservation_fails = 0
    L1D_cache_data_port_util = 0.005
    L1D_cache_fill_port_util = 0.010
L1C_cache:
    L1C_total_cache_accesses = 0
L1T_cache:
    L1T_total_cache_accesses = 0
========= L2 cache stats =========
L2_cache_bank[0]: Access = 0, Miss = 0, Miss_rate = -nan, Pending_hits = 0, Reservation_fails = 0
L2_cache_bank[1]: Access = 0, Miss = 0, Miss_rate = -nan, Pending_hits = 0, Reservation_fails = 0
L2_cache_bank[2]: Access = 4, Miss = 4, Miss_rate = 1.000, Pending_hits = 0, Reservation_fails = 0
L2_cache_bank[3]: Access = 0, Miss = 0, Miss_rate = -nan, Pending_hits = 0, Reservation_fails = 0
L2_cache_bank[4]: Access = 0, Miss = 0, Miss_rate = -nan, Pending_hits = 0, Reservation_fails = 0
L2_cache_bank[5]: Access = 0, Miss = 0, Miss_rate = -nan, Pending_hits = 0, Reservation_fails = 0
L2_cache_bank[6]: Access = 0, Miss = 0, Miss_rate = -nan, Pending_hits = 0, Reservation_fails = 0
L2_cache_bank[7]: Access = 8, Miss = 4, Miss_rate = 0.500, Pending_hits = 0, Reservation_fails = 0
L2_cache_bank[8]: Access = 0, Miss = 0, Miss_rate = -nan, Pending_hits = 0, Reservation_fails = 0
L2_cache_bank[9]: Access = 0, Miss = 0, Miss_rate = -nan, Pending_hits = 0, Reservation_fails = 0
L2_cache_bank[10]: Access = 0, Miss = 0, Miss_rate = -nan, Pending_hits = 0, Reservation_fails = 0
L2_cache_bank[11]: Access = 0, Miss = 0, Miss_rate = -nan, Pending_hits = 0, Reservation_fails = 0
L2_cache_bank[12]: Access = 0, Miss = 0, Miss_rate = -nan, Pending_hits = 0, Reservation_fails = 0
L2_cache_bank[13]: Access = 0, Miss = 0, Miss_rate = -nan, Pending_hits = 0, Reservation_fails = 0
L2_cache_bank[14]: Access = 0, Miss = 0, Miss_rate = -nan, Pending_hits = 0, Reservation_fails = 0
L2_cache_bank[15]: Access = 0, Miss = 0, Miss_rate = -nan, Pending_hits = 0, Reservation_fails = 0
L2_total_cache_accesses = 12
L2_total_cache_misses = 8

As you can see, L1D misses are 8, but L2 accesses are 12. Also counting the number of instances of access() and L2 accesses shows that 12 accesses to L2 are correct, but L1D accesses are indeed 24.

[gpu-cache.cc] 5079 7f65e1000000
[gpu-cache.cc] 5079 7f65e1000020
[gpu-cache.cc] 5079 7f65e1000040
[gpu-cache.cc] 5079 7f65e1000060
[gpu-cache.cc] 5081 7f6607e00000
[gpu-cache.cc] 5081 7f6607e00020
[gpu-cache.cc] 5081 7f6607e00040
[gpu-cache.cc] 5081 7f6607e00060
[gpu-cache.cc] 10070 7f65e1000000
[l2cache.cc] 5270 7f65e1000000
[gpu-cache.cc] 10071 7f65e1000020
[l2cache.cc] 5271 7f65e1000020
[gpu-cache.cc] 10072 7f65e1000040
[l2cache.cc] 5272 7f65e1000040
[gpu-cache.cc] 10073 7f65e1000060
[l2cache.cc] 5273 7f65e1000060
[gpu-cache.cc] 10074 7f6607e00000
[l2cache.cc] 5274 7f6607e00000
[gpu-cache.cc] 10075 7f6607e00020
[l2cache.cc] 5275 7f6607e00020
[gpu-cache.cc] 10076 7f6607e00040
[l2cache.cc] 5276 7f6607e00040
[gpu-cache.cc] 10077 7f6607e00060
[l2cache.cc] 5277 7f6607e00060
[gpu-cache.cc] 5608 7f6607e00000
[gpu-cache.cc] 5608 7f6607e00020
[gpu-cache.cc] 5608 7f6607e00040
[gpu-cache.cc] 5608 7f6607e00060
[gpu-cache.cc] 10599 7f6607e00000
[l2cache.cc] 5799 7f6607e00000
[gpu-cache.cc] 10600 7f6607e00020
[l2cache.cc] 5800 7f6607e00020
[gpu-cache.cc] 10601 7f6607e00040
[l2cache.cc] 5801 7f6607e00040
[gpu-cache.cc] 10602 7f6607e00060
[l2cache.cc] 5802 7f6607e00060

The cycle numbers from L1D miss to L2 access are weird, too. According to the printfs, on L2 miss, first a [gpu-cache.cc] is printed and then a [l2cache.cc]. They should be in the same cycle.

test.zip

mahmoodn commented 1 year ago

OK. I found that the statistics are correct in this example. 7f65e1000000 -> read miss 7f65e1000020 -> sector miss 7f65e1000040 -> sector miss 7f65e1000060 -> sector miss 7f6607e00000 -> read miss 7f6607e00020 -> sector miss 7f6607e00040 -> sector miss 7f6607e00060 -> sector miss

7f65e1000000 -> L2 miss 7f65e1000020 -> L2 miss 7f65e1000040 -> L2 miss 7f65e1000060 -> L2 miss 7f6607e00000 -> L2 miss 7f6607e00020 -> L2 miss 7f6607e00040 -> L2 miss 7f6607e00060 -> L2 miss

7f6607e00000 -> write hit 7f6607e00020 -> write hit 7f6607e00040 -> write hit 7f6607e00060 -> write hit

7f6607e00000 -> L2 write hit 7f6607e00020 -> L2 write hit 7f6607e00040 -> L2 write hit 7f6607e00060 -> L2 write hit

So, Although L1D misses are 8, L2 accesses are 8 (L1D misses) + 4 (write hits) which is 12. I will close the issue.