CRAC replays the entire log in order to guarantee that activememory allocations are restored at the original address. CRAC relies on determinism of the CUDA library allocation. CRAC also disables address space randomization using Linux’s personality system call.
I doubt this because in my experiment, I cannot come to the same conclusion.
OS env:
CUDA toolkit: 10.2
OS: ubuntu 18.04
kernel: 4.15.0-159-generic
Here's the code:
// a.cu
#include <cuda_runtime_api.h>
#include <stdio.h>
#include <sys/personality.h>
int main(int argc, char **argv) {
int ret = personality(ADDR_NO_RANDOMIZE);
if (ret < 0) {
printf("fail\n");
return 1;
}
for (int i = 0; i < 2; i++) {
void *devPtr;
cudaMalloc(&devPtr, 16);
printf("%p\n", devPtr);
}
return 0;
}
Would you explain a bit more about the "determinism"? Am I missing something?
My team have been working on the migration of CUDA workload for several months, by setting frontend running CUDA app and backend running tcp server, so that workload could be forwarded to different hosts, which enables seamless migration and scheduling. Forwarding API is easy, but handing fatbins properly is almost impossible(it requires too much reverse engineering).
Now our work can migrate simple CUDA workload, not fails on complex workload, such as multiple .cu files with lots of relocatable objects. In terms of memory management, we record the address mapping, and replace the API call args at backend.
So I'm interested in the idea of no-IPC proxy, due to it may bridges the gap. Are your team still working on it? I'm looking forward to have deeper and further technical discussion.
Greetings, after reading your paper, I notice:
I doubt this because in my experiment, I cannot come to the same conclusion.
OS env:
Here's the code:
compilation:
nvcc a.cu --cudart=shared
output:
Would you explain a bit more about the "determinism"? Am I missing something?
My team have been working on the migration of CUDA workload for several months, by setting frontend running CUDA app and backend running tcp server, so that workload could be forwarded to different hosts, which enables seamless migration and scheduling. Forwarding API is easy, but handing fatbins properly is almost impossible(it requires too much reverse engineering).
Now our work can migrate simple CUDA workload, not fails on complex workload, such as multiple .cu files with lots of relocatable objects. In terms of memory management, we record the address mapping, and replace the API call args at backend.
So I'm interested in the idea of no-IPC proxy, due to it may bridges the gap. Are your team still working on it? I'm looking forward to have deeper and further technical discussion.
Thanks!