HewlettPackard / quartz

Quartz: A DRAM-based performance emulator for NVM
https://github.com/HewlettPackard/quartz
Other
158 stars 66 forks source link

What is pflush()? #15

Open wenwen412 opened 6 years ago

wenwen412 commented 6 years ago

HI,

I found there is a pflush() function in the code. Do we need to call it in our user programs in order to inject the PM latency we want?

guimagalhaes commented 6 years ago

This is probably related to memory write latency emulation, which I am not aware. @hvolos, could you provide a summary about pflush?

wenwen412 commented 6 years ago

The reason why I am asking is that I found when I change the read/write latency in nvmemul.ini, the random read latency actually changes accordingly, but random write latency doesn't change much even if I add a custom flush/mfence function.

BTW, the function I use to test random write is:

    mem = (unsigned long **) pmalloc(BUF_SIZE * sizeof(unsigned long *));

    for (i=0; i < BUF_SIZE; ++i) {
        mem[i] = (unsigned long *) pmalloc(BUF_SIZE * sizeof(unsigned long));

    }

    start_time = clock();
    for (i=0; i < BUF_SIZE; ++i) {
        for (j=0; j < BUF_SIZE; ++j) {
            int a, b;
            a = rand()%BUF_SIZE;
            b = rand()%BUF_SIZE;
            mem[a][b] = a * b;
            persistent(&(mem[a][b]), sizeof(long), 1);
        }
    }   
    end_time = clock();
    // rand_latency is the latency of rand() calls
    duration = (double)(end_time - start_time) / CLOCKS_PER_SEC - rand_latency;
    printf( "Time for NVM Writing is %f seconds\n ", duration );
hvolos commented 6 years ago

pflush is meant to emulate the effect of a synchronous cacheline flush from cache to memory (similarly to clwb/clflush in intel x86). The latency is controlled through the latency.write setting in nvmemul.ini.

In your code example, what does persistent do? Could you provide the code executed by persistent? Or otherwise, could you replace persistent with pflush(&mem[a][b]) and see whether this helps?

wenwen412 commented 6 years ago

@hvolos Thank you for your explanation. Here is the persistent function I use:

#define _mm_clflush(addr)\
        asm volatile("clflush %0" : "+m" (*(volatile char *)(addr)))
#define _mm_clflushopt(addr)\
        asm volatile(".byte 0x66; clflush %0" : "+m" (*(volatile char *)(addr)))
#define _mm_clwb(addr)\
        asm volatile(".byte 0x66; xsaveopt %0" : "+m" (*(volatile char *)(addr)))
#define _mm_pcommit()\
        asm volatile(".byte 0x66, 0x0f, 0xae, 0xf8")

#define CACHELINE_SIZE 64
static inline void PERSISTENT_BARRIER(void)
{   
    asm volatile ("sfence\n" : : );
}

static inline void persistent(void *buf, int len, int fence)
{   
    int i;
    len = len + ((unsigned long)(buf) & (CACHELINE_SIZE - 1));
    int support_clwb = 0;

    if (support_clwb) { 
        for (i = 0; i < len; i += CACHELINE_SIZE)
                _mm_clwb(buf + i);
    } else {
        for (i = 0; i < len; i += CACHELINE_SIZE)
                _mm_clflush(buf + i);
    }
    if (fence)
        PERSISTENT_BARRIER();
}

Does Quartz relies on pflush() to inject write latency? Why the random write latency doesn't change much when I double the latancy.write value in .ini file?

hvolos commented 6 years ago

You would need to modify your macros to use pflush instead of the actual clflush instruction as we don't have a way to interpose on the clflush instruction.

If there was a performance counter that counts the number of clflush invocations then we could perhaps leverage that similarly to how we leverage the cache misses performance counters to introduce latency on the read access path, but I don't think there is such a counter.

guimagalhaes commented 6 years ago

Please consider adding a note in the README file to indicate how the pflush is used. Currently, this file indicates the write latency emulation is not supported. Thanks!

wenwen412 commented 6 years ago

screen shot 2017-10-11 at 11 21 13 am Here are the test results I got. (pflush() is not used in those tests). I will try test random write with pflush() to see if it is normal.

wenwen412 commented 6 years ago

Hi @hvolos ,

I tried pflush(&mem[i][j]), and it takes 33 seconds for 10001000 random write. Which means each random write takes 3310^9/10^6 ns = 33,000 ns, which is much higher than 1000 ns setting in nvmemul.ini. Meanwhile, for 1000*1000 random read only takes 0.015 seconds.

hvolos commented 6 years ago

Would it be easy to post your microbenchmark so that I try to reproduce what you see?

hkundnani commented 5 years ago

Since this issue is still open, I would post here instead of opening a new issue. Does Quartz support write latency using pflush()?

If it does can this please be updated on the readme page with some documentation or example.

templestorager commented 4 years ago

Since there is no explanations on how to use pflush (it's essentially a clflush instruction plus emulated additional write latency), it's not clear how to use this function in app code. Does the address passed to pflush represent the address to be flushed (let's call it "direction mode") or the address of the variable holding the address to be flushed (let's call it "indirect mode"), which require the following two implementations (IMO), respectively:

define asm_clflush(addr) \

({ \ asm volatile ("clflush %0" : : "m"(addr)); \ }) and define asm_clflush(addr) \ ({ \ asm volatile ("clflush %0" : : "m"(addr)); \ }) Note the value to be indicated to be stored in memory are (addr) and (addr), respectively. Confusingly, I have seen usage in the kernel code that uses the indirect mode using the char pointer, which the char pointer denotes the address to be flushed...

I may misunderstand the issue, but can anybody shed lights on this ?