Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 986 forks source link

fwrite: avoid converting matrix input to data.table #3188

Open mattdowle opened 5 years ago

mattdowle commented 5 years ago

PR #3125 implemented #2613 to enable to fwrite to accept matrix. This was merged in v1.12.0. However it converts matrix to data.table which can be costly. The convert can be avoided as follows. fwriteR.c contains this :

for (int j=0; j<args.ncol; j++) {
    SEXP column = VECTOR_ELT(DFcoerced, j);
    if (args.nrow != length(column))
      error("Column %d's length (%d) is not the same as column 1's length (%d)", j+1, length(column), args.nrow);
    int32_t wf = whichWriter(column);
    if (wf<0) {
      error("Column %d's type is '%s' - not yet implemented in fwrite.", j+1, type2char(TYPEOF(column)));
    }
    args.columns[j] = (wf==WF_CategString ? column : (void *)DATAPTR(column));
    args.whichFun[j] = (uint8_t)wf;
    if (TYPEOF(column)==VECSXP && firstListColumn==0) firstListColumn = j+1;
  }

This loop populates the pointers in arg.columns[] which are then passed to fwrite.c. That DATAPTR returns a pointer to where the data (e.g. int *, double *) for the R vector starts. In the case of matrix, this args.columns[] just needs to be populated with offsets into matrix vector. In R a matrix is a single very long vector just with dimension attribute attached. A matrix in R is columnar, just like a data.table, so there doesn't need to be a transpose. It should be fairly simple to do and work well.

Careful to add tests for a factor matrix as well as a character matrix.

I described how to do it in case @fparages or @MichaelChirico wanted to give it a go.

MichaelChirico commented 5 years ago

Agree the as.data.table approach is a bit of a have and this is right way to do things.

How does one point to matrix column names in C API?

Don't think I'll have time before next year to work on something that for me & my lack of C experience will end up being quite time consuming (only mentioning because I see the 1.12.0 milestone and know you mentioned maybe pushing out soon given the memory fixes)

mattdowle commented 5 years ago

Sure thing.

args.columns[j] = (char *)DATAPTR(matrix) + j*size*nrow;

where size is 4 for int and 8 for double, for example.

jangorecki commented 5 years ago

instead of DATAPTR isn't it more appropriate to use switch(TYPEOF(matrix)) ... INTEGER(matrix) ... REAL(matrix)?

mattdowle commented 5 years ago

It's more convenient and appropriate to use DATAPTR. But we're not supposed to is more the point because it's not part of the R API. I wish it were and I don't see why it isn't. It should be the switch() way to be compliant, yes. We could create our own dataptr() function that does the switch. Seems silly to have to do so though when DATAPTR could just be added to the R API. There must be a good reason why it isnt. Maybe now that INTEGER, REAL and DATAPTR are no longer ever macros, even under USE_RINTERNALS, the original reason for not exposing DATAPTR in R API might have gone away.

raneameya commented 5 years ago

I'd like to draw attention to the fact that matrices can have rownames which in the current version are stripped away in the as.data.table conversion. May I please ask that rownames be written as the first column by default if it is sensible? If not, maybe an additional argument to include rownames, like as.data.table?

mattdowle commented 5 years ago

Re rownames: sure thing. If rownames are the default 1:nrow then can we assume please they aren't really rownames and not write them by default? Internally R uses c(NA,INT_MIN) to represent default row names efficiently, so it's safe and efficient to test that attribute to see if row.names is a length 2 integer vector holding those values. With an argument to explicitly specify row names too of course; we're just talking about the default here.

raneameya commented 5 years ago

Thanks Matt!