CMU-SAFARI / ramulator-pim

A fast and flexible simulation infrastructure for exploring general-purpose processing-in-memory (PIM) architectures. Ramulator-PIM combines a widely-used simulator for out-of-order and in-order processors (ZSim) with Ramulator, a DRAM simulator with memory models for DDRx, LPDDRx, GDDRx, WIOx, HBMx, and HMCx. Ramulator is described in the IEEE CAL 2015 paper by Kim et al. at https://people.inf.ethz.ch/omutlu/pub/ramulator_dram_simulator-ieee-cal15.pdf Ramulator-PIM is used in the DAC 2019 paper by Singh et al. at https://people.inf.ethz.ch/omutlu/pub/NAPEL-near-memory-computing-performance-prediction-via-ML_dac19.pdf
145 stars 61 forks source link

can't simulate many cores #13

Closed FanYang98 closed 3 years ago

FanYang98 commented 3 years ago

Hi, when i try to set many cores(>6) ,zsim can't simulate all cores, it only simulate 6 cores.For example,when i set cores = 8; the stats file said [core-6][core-7] cycles instrs IPC were all 0. 😢 Can someone please let me know what might be the possible thing I'm missing here.

avacoder42 commented 3 years ago

Hi, If you are still pondering, you can use a similar config in zsim. (I have tried it for 32 cores and it works)

// This system is similar to a 6-core, 2.4GHz Westmere with 10 Niagara-like cores attached to the L3
sys = {
    lineSize = 64;
    frequency = 2400;

    cores = {
        core = {
            type = "OOO";
            cores = 32;
            icache = "l1i";
            dcache = "l1d";
        };
    };

    caches = {
        l1d = {
            array = {
                type = "SetAssoc";
                ways = 8;
            };
            caches = 32;
            latency = 4;
            size = 32768;
        };
        l1i = {
            array = {
                type = "SetAssoc";
                ways = 4;
            };
            caches = 32;
            latency = 3;
            size = 32768;
        };
        l2 = {
            array = {
                type = "SetAssoc";
                ways = 8;
            };
        //type = "Timing";
        //mshrs = 10;
            caches = 32;
            latency = 7;
            children = "l1i|l1d";
            size = 262144;
        };
        l3 = {
            array = {
                hash = "H3";
                type = "SetAssoc";
                ways = 16;
            };
        //type = "Timing";
        //mshrs = 16;
            banks = 32;
            caches = 1;
            latency = 27;
            children = "l2";
        size = 67108864;
        };

    };

    mem = {
        type = "Traces";
        instr_traces = true;
          only_offload = true;
          pim_traces = true;

        outFile = "pim-poly_cholesky_32.out"
    };

};

sim = {
    phaseLength = 10000;
    maxTotalInstrs = 10000000000L;
    statsPhaseInterval = 1000;
    printHierarchy = true;
    // attachDebugger = True;
};

process0 = {
    command = "benchmarks/PolyBench-ACC-master/OpenMP/linear-algebra/kernels/cholesky/cholesky" ;
    startFastForwarded = True;
//    command = "ls -la";
//    command = "unzip tracesLois.out.gz";
};
FanYang98 commented 3 years ago

Hi, If you are still pondering, you can use a similar config in sim. (I have tried it for 32 cores and it works)

// This system is similar to a 6-core, 2.4GHz Westmere with 10 Niagara-like cores attached to the L3
sys = {
    lineSize = 64;
    frequency = 2400;

    cores = {
        core = {
            type = "OOO";
            cores = 32;
            icache = "l1i";
            dcache = "l1d";
        };
    };

    caches = {
        l1d = {
            array = {
                type = "SetAssoc";
                ways = 8;
            };
            caches = 32;
            latency = 4;
            size = 32768;
        };
        l1i = {
            array = {
                type = "SetAssoc";
                ways = 4;
            };
            caches = 32;
            latency = 3;
            size = 32768;
        };
        l2 = {
            array = {
                type = "SetAssoc";
                ways = 8;
            };
      //type = "Timing";
      //mshrs = 10;
            caches = 32;
            latency = 7;
            children = "l1i|l1d";
            size = 262144;
        };
        l3 = {
            array = {
                hash = "H3";
                type = "SetAssoc";
                ways = 16;
            };
      //type = "Timing";
      //mshrs = 16;
            banks = 32;
            caches = 1;
            latency = 27;
            children = "l2";
      size = 67108864;
        };

    };

    mem = {
        type = "Traces";
        instr_traces = true;
        only_offload = true;
        pim_traces = true;

        outFile = "pim-poly_cholesky_32.out"
    };

};

sim = {
    phaseLength = 10000;
    maxTotalInstrs = 10000000000L;
    statsPhaseInterval = 1000;
    printHierarchy = true;
    // attachDebugger = True;
};

process0 = {
    command = "benchmarks/PolyBench-ACC-master/OpenMP/linear-algebra/kernels/cholesky/cholesky" ;
    startFastForwarded = True;
//    command = "ls -la";
//    command = "unzip tracesLois.out.gz";
};

Hi, Thanks anyway for your helping, I also found that it worked well when I used OpenMP. But I met the above problem when using pthread, which I guess was about detail about the simulator.