DMTCP-CRAC / CRAC-early-development

Other
22 stars 9 forks source link

Question about the "determinism" of CUDA #10

Open xial-thu opened 1 year ago

xial-thu commented 1 year ago

Greetings, after reading your paper, I notice:

CRAC replays the entire log in order to guarantee that activememory allocations are restored at the original address. CRAC relies on determinism of the CUDA library allocation. CRAC also disables address space randomization using Linux’s personality system call.

I doubt this because in my experiment, I cannot come to the same conclusion.

OS env:

Here's the code:

// a.cu
#include <cuda_runtime_api.h>
#include <stdio.h>
#include <sys/personality.h>

int main(int argc, char **argv) {
    int ret = personality(ADDR_NO_RANDOMIZE);
    if (ret < 0) {
        printf("fail\n");
        return 1;
    }
    for (int i = 0; i < 2; i++) {
        void *devPtr;
        cudaMalloc(&devPtr, 16);
        printf("%p\n", devPtr);
    }
    return 0;
}

compilation: nvcc a.cu --cudart=shared

output:

❯ ./a.out
0x7fa643000000
0x7fa643000200
❯ ./a.out
0x7fe4f7000000
0x7fe4f7000200

Would you explain a bit more about the "determinism"? Am I missing something?

My team have been working on the migration of CUDA workload for several months, by setting frontend running CUDA app and backend running tcp server, so that workload could be forwarded to different hosts, which enables seamless migration and scheduling. Forwarding API is easy, but handing fatbins properly is almost impossible(it requires too much reverse engineering).

Now our work can migrate simple CUDA workload, not fails on complex workload, such as multiple .cu files with lots of relocatable objects. In terms of memory management, we record the address mapping, and replace the API call args at backend.

So I'm interested in the idea of no-IPC proxy, due to it may bridges the gap. Are your team still working on it? I'm looking forward to have deeper and further technical discussion.

Thanks!