bobye / d2_kmeans

Fast discrete distributions clustering using Wasserstein barycenter with sparse support
Other
12 stars 3 forks source link

intra-node memory optimization (MPI) #35

Open bobye opened 8 years ago

bobye commented 8 years ago

@robbwu

mpi能不能做intra-node的内存优化,我有一块比较大的只读区域,能不能让shared memory的process从同一个地方读?

bobye commented 8 years ago

https://github.com/bobye/d2_kmeans/blob/master/src/d2/clustering_io.c#L52

Start from line 52, we need to read header information of data. This part is read-only, so one can share between other processes within the same node.

Remark that based on the nature of data, different header data are read (either p_data->ph[n].dist_mat or p_data->ph[n].vocab_vec)

@robbwu

robbwu commented 8 years ago

My understanding is that all processes need to read the same file into memory, and you want to share the memory among the processes. The usual way to do inter-process memory sharing is:

  1. one process uses shm_open() to create a named memory object (e.g. /mydata )
  2. ftruncate() to set the memory object to your desired size;
  3. every process use mmap() to map the memory object into its own virtual memory space;
  4. one process read the file into the mmap'ed memory space
  5. all process can access the shared memory object using their own mmap'ed memory space.

In this way, only one copy of the memory object is in the physical memory; all processes will access the same physical memory.

robbwu commented 8 years ago

basically, for the memory you want to share, use mmap instead of malloc.

bobye commented 8 years ago

How is the behavior that if one process from another node wants to read this memory object. It there any cross node communication? I assume no communication is needed to read data from this memory object.

@robbwu

robbwu commented 8 years ago

First thing is for each node, we need to select a MPI rank to do the shm_open, read the file and write to the memory object. We can do this by creating a communicator for each node and use local rank 0 process to do the above work. (http://www.open-mpi.org/doc/v1.8/man3/MPI_Comm_split_type.3.php)

//  newcomm is the communicator for each node. 
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED , rank, info,  *newcomm);
MPI_Comm_rank(newcomm, &local_rank);

if(local_rank==0) {// only one process in each node does the following
   // create a named memory object
   fd = shm_open("/mydata", O_RDWR | O_CREAT, S_IRUSR | S_IWUSR)
   ftruncate(fd, <the size of shared memory space>);
   rptr = mmap(NULL, <the size of shared memory space>, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
   // now read the file into rptr
   MPI_Barrier( newcomm );
} else {
   MPI_Barrier( newcomm );
   fd = shm_open("/mydata", O_RDONLY);
   rptr = mmap(NULL, <the size of shared memory space>, PROT_READ, MAP_SHARED, fd, 0);
}

after that, the pointer rptr points to the shared memory space. Also remember to shm_unlink() and munmap after use.

bobye commented 8 years ago

Looks good. If data race is allowed, we can still use mpi2, I guess. Any negative effects?

bobye commented 8 years ago

One more question:

should /mydata be different for different nodes?

@robbwu

robbwu commented 8 years ago

/mydata could be the same for each node. The namespace is confined to a single node so no interference is possible and it makes programming easier. Think of it like a local file.

MPI-3 adds MPI_Comm_split_type function. You can make the "share memory" feature conditional on MPI-3. If user does not have MPI-3 then they don't have shared memory.

bobye commented 8 years ago

Yep, that's my plan.