bsc-mem / Mess-benchmark

Other
8 stars 1 forks source link

How to use Mess-benchmark to obtain the bandwidth-delay curve for CXL memory? #2

Closed pudding-art closed 1 month ago

pudding-art commented 1 month ago

Hello, I would like to use Mess-benchmark to obtain the bandwidth-delay curve for CXL memory. The CXL memory in the system is located on NUMA2 and NUMA3 nodes, as shown in the output of numactl -H.

node 2 cpus:
node 2 size: 64511 MB
node 2 free: 64149 MB
node 3 cpus:
node 3 size: 64504 MB
node 3 free: 64145 MB
node distances:
node   0   1   2   3 
  0:  10  21  24  24 
  1:  21  10  14  14 
  2:  24  14  10  16 
  3:  24  14  16  10

The current system's CPU information is as follows:

hwt@cxl2:~/workspace/Mess-benchmark-main/CPU/Actual-hardware/x86/Intel-SapphireRapids-Xeon-Platinum-8480$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  192
  On-line CPU(s) list:   0-191
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8468V
    CPU family:          6
    Model:               143
    Thread(s) per core:  2
    Core(s) per socket:  48
    Socket(s):           2
    Stepping:            8
    CPU max MHz:         3800.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4800.00

Linux Kernel Info: 6.5.0-18-generic

After modifying the config.sh file as shown below, will I be able to correctly measure the bandwidth-delay curve for CXL memory?

config.sh:

NAME=mn5
CPU="Intel Xeon Platinum 8468v"
CPU_FREQ=3.8
MEM_TYPE="DDR5"
OPTANE=
OPTANE_FREQ=
OPTANE_MAX_CHANNELS=
DRAM=
DRAM_FREQ=4800
DRAM_MAX_CHANNELS=8
STLB_HIT_LATENCY=7
STREAM_CORE_LIST="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48"
STREAM_CORE_COUNT=48
STREAM_CORE_COUNT_SOCKET=48
STREAM_NUMACTL_ADDITIONAL_ARGS="numactl --membind=2"
PTRCHASE_CORE="0"
PTRCHASE_NUMACTL_ADDITIONAL_ARGS="--membind 2"
RWRATIO_MIN=0
RWRATIO_MAX=100
RWRATIO_STEP=2
PAUSES="0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 80 90 100 120 140 160 180 200 220 260 300 340 380 450 550 600 700 800 900 1000 1500 2000 3000 5000 40000 100000"
POINT_REPS=3
BW_MEAS_REPS=1
LAT_MEAS_REPS=1
TIME_STREAM_STABILIZE=20
TIME_AFTER_BW_MEAS=4
TIME_AFTER_STREAM_TERMINATION=0
BW_TOOL="likwid"
BW_TOOL_PATH="likwid-perfctr"
BW_TOOL_CORES="2-111"
BW_TOOL_SAMPLE_TIME="5s"
BW_TOOL_COUNTERS="INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,LONGEST_LAT_CACHE_MISS:PMC0,CAS_COUNT_RD:MBOX0C0,CAS_COUNT_WR:MBOX0C1,CAS_COUNT_RD:MBOX1C0,CAS_COUNT_WR:MBOX1C1,CAS_COUNT_RD:MBOX2C0,CAS_COUNT_WR:MBOX2C1,CAS_COUNT_RD:MBOX3C0,CAS_COUNT_WR:MBOX3C1,CAS_COUNT_RD:MBOX4C0,CAS_COUNT_WR:MBOX4C1,CAS_COUNT_RD:MBOX5C0,CAS_COUNT_WR:MBOX5C1,CAS_COUNT_RD:MBOX6C0,CAS_COUNT_WR:MBOX6C1,CAS_COUNT_RD:MBOX7C0,CAS_COUNT_WR:MBOX7C1,CAS_COUNT_RD:MBOX8C0,CAS_COUNT_WR:MBOX8C1,CAS_COUNT_RD:MBOX9C0,CAS_COUNT_WR:MBOX9C1,CAS_COUNT_RD:MBOX10C0,CAS_COUNT_WR:MBOX10C1,CAS_COUNT_RD:MBOX11C0,CAS_COUNT_WR:MBOX11C1,CAS_COUNT_RD:MBOX12C0,CAS_COUNT_WR:MBOX12C1,CAS_COUNT_RD:MBOX13C0,CAS_COUNT_WR:MBOX13C1,CAS_COUNT_RD:MBOX14C0,CAS_COUNT_WR:MBOX14C1,CAS_COUNT_RD:MBOX15C0,CAS_COUNT_WR:MBOX15C1"
BW_TOOL_CUSTOM_CMD=
LAT_TOOL="perf"
LAT_TOOL_PATH="perf"
LAT_TOOL_COUNTERS="cycles:u,instructions:u,r2012:u,r1012:u"
LAT_TOOL_CUSTOM_CMD=
SMOOTH_SAVGOL_WINDOW_LENGTH=11
SMOOTH_SAVGOL_POLYORDER=3
PTRCHASE_NUM_INSTRUCTIONS=200000000
PTRCHASE_NUM_ITERATIONS=5000
PTRCHASE_ARRAY_SIZE="1 * 1024 * 1024 * 1024"
PTRCHASE_WARMUP=
PTRCHASE_TARGET_LEVEL=
STREAM_ARRAY_SIZE=80000000
STREAM_STORE_TYPE="temporal"

Thank you!

poyaesy commented 1 month ago

Hi,

Please use the benchmark located in the address below: https://github.com/bsc-mem/Mess-benchmark/tree/main/CPU/Actual-hardware/x86/Intel-SapphireRapids-Xeon-Gold-6448Y-CXL

In the config file, you can change the target CXL device by changing the line interleave-numa-nodes ="2" in config/sapphireRapids_CXL.toml file. currently, interleave-numa-nodes is set to numa node 2, which is the CXL device configured as CPU-less memory node 2.

pudding-art commented 1 month ago

Thank you!