Kevin-Jin / mmap

Forked from https://r-forge.r-project.org/scm/?group_id=648
1 stars 1 forks source link

Refactor mmap indexing functions to eliminate possible O(n*m) memory usage #3

Open Kevin-Jin opened 7 years ago

Kevin-Jin commented 7 years ago

The offending code is found in `[.mmap`() and `[<-.mmap`() in mmap.R, i.e.:

if(missing(i))
  i <- 1:length(x)
if(missing(j))
  j <- 1:length(x$storage.mode)

[...]

if( missing(i))
  i <- 1:dim(x)[1]
if( missing(j))
  j <- 1:dim(x)[2]

And in convert_ij_to_i() in mmap.c, i.e.:

PROTECT( newi = allocVector(INTSXP, leni * lenj));
_newi = INTEGER(newi); 

for(jj=0; jj<lenj; jj++) {
  for(ii=0; ii<leni; ii++) {
    _newi[n++] = (_j[jj] - 1) * _rows + _i[ii];
  }
}

Although a time complexity of O(n*m) is unavoidable, the amount of temporary space that is currently being used is certainly avoidable. It's easy to see how the current solution can be problematic for larger files.

Kevin-Jin commented 7 years ago

A related issue is also properly handling negative indices in these functions (indicating that certain columns should be excluded while all others are retrieved).

Kevin-Jin commented 7 years ago

Explicit calls to rep should also not be necessary for the sake of recycling elements of value in `[<-.mmap`().

Kevin-Jin commented 7 years ago

Sequence detection would also be useful for mprotect() and madvise() since currently a system call is being issued per index passed. The system calls can easily accept contiguous records if we can detect sequential indices.