Speedup due to PIM execution

CMU-SAFARI / ramulator-pim

A fast and flexible simulation infrastructure for exploring general-purpose processing-in-memory (PIM) architectures. Ramulator-PIM combines a widely-used simulator for out-of-order and in-order processors (ZSim) with Ramulator, a DRAM simulator with memory models for DDRx, LPDDRx, GDDRx, WIOx, HBMx, and HMCx. Ramulator is described in the IEEE CAL 2015 paper by Kim et al. at https://people.inf.ethz.ch/omutlu/pub/ramulator_dram_simulator-ieee-cal15.pdf Ramulator-PIM is used in the DAC 2019 paper by Singh et al. at https://people.inf.ethz.ch/omutlu/pub/NAPEL-near-memory-computing-performance-prediction-via-ML_dac19.pdf

144 stars 60 forks source link

Speedup due to PIM execution #6

Open veronia-iskandar opened 4 years ago

veronia-iskandar commented 4 years ago

Hello, I'm new to ramulator-pim. When I try to run the sample traces, the host execution is better than PIM in terms of ipc and time. Could you point me to an example to show the benefits of using PIM? (i.e PIM speedup a part of code for example) Thanks!

geraldofojunior commented 4 years ago

Hi, please notice that the host traces are filtered by the L3 caches, thus, they do not consider the L1/L2/L3 hit/misses ratios. To calculate the final number of cycles executed by the Host, you need to account for those latencies based on Zsim stats (i.e., you need to include those cycles when calculating the final host IPC).

amir-parvaresh commented 4 years ago

Hi, I have the exact same problem. Would be possible for you to tell us how we should calculate the overall IPC with the L1/L2/L3 hit/miss ratios? Thank you so much!