armgong / rjulia

R package which integrating R and Julia
GNU General Public License v2.0
146 stars 23 forks source link

speedier TransArrayToDataArray? #33

Closed phaverty closed 7 years ago

phaverty commented 8 years ago

jl_eval_string seems to be one of the slower bits of the julia C API. One might be able to do a version of TransArrayToDataArray that uses jl_get_function and jl_call2. This discussion seems to offer the solution. I experimented a bit and have not yet been successful getting a pointer to DataArrays.DataArray (and/or the DataArrays module), but I think it should be possible. Compiling on Mac gives a number of warnings related to snprintf, so this alternate strategy would have a nice side effect of removing those warnings.

armgong commented 8 years ago

last year, I try this in https://github.com/armgong/rjulia/blob/nextgen/src/R_Julia.c on julia 0.3 , but the speed not improve so much, maybe the situation change now, you can try again.

btw

the julia C API version R_Julia_MD_NA start from line 132 ,in the codes I use loop to copy value and na value. after some thinking, this function can improve by memcopy the "data" array. and for the "na" array, maybe we can use is.na(rvarible) to create a Boolean vector in R, then memcopy it to "na" array.

static jl_value_t *R_Julia_MD_NA(SEXP Var, const char *VarName)
{
  if ((LENGTH(Var)) == 0)
    return (jl_value_t *) jl_nothing;

 jl_tuple_t *dims = RDims_JuliaTuple(Var);
 jl_value_t *ret =NULL;
 jl_value_t *ret1 =NULL;
 jl_value_t *ans=NULL;
 JL_GC_PUSH4(&ret, &ret1,&dims,&ans);
 jl_function_t *DataArray=jl_get_function(jl_main_module,"DataArray");
 jl_function_t *setindex=jl_get_function(jl_main_module,"setindex!");
 switch (TYPEOF(Var))
   {
    case LGLSXP:
    {
      ans=jl_call2(DataArray, (jl_value_t*) jl_bool_type, (jl_value_t*) dims);
      ret =jl_get_field(ans,"data");
      ret1 =jl_get_field(ans,"na");

      char *retData = (char *)jl_array_data(ret);
      for (size_t i = 0; i < jl_array_len(ret); i++)
      {
        if (LOGICAL(Var)[i] == NA_LOGICAL)
        {
          retData[i] = 1;
          jl_call3(setindex,ret1,jl_box_bool(1),jl_box_long(i+1));
        }
        else
        {
          retData[i] = LOGICAL(Var)[i];
          jl_call3(setindex,ret1,jl_box_bool(0),jl_box_long(i+1));
        }
      }
      break;
    }
    case INTSXP:
    {
      ans=jl_call2(DataArray,(jl_value_t*) jl_int32_type,(jl_value_t*) dims);
      ret =jl_get_field(ans,"data");
      ret1 =jl_get_field(ans,"na");

      int *retData = (int *)jl_array_data(ret);
      for (size_t i = 0; i < jl_array_len(ret); i++)
      {
        if (INTEGER(Var)[i] == NA_INTEGER)
        {
          retData[i] = 999;
          jl_call3(setindex,ret1,jl_box_bool(1),jl_box_long(i+1));
        }
        else
        {
          retData[i] = INTEGER(Var)[i];
          jl_call3(setindex,ret1,jl_box_bool(0),jl_box_long(i+1));
        }
      }
      break;
    }
    case REALSXP:
    {
      ans=jl_call2(DataArray,(jl_value_t*) jl_float64_type,(jl_value_t*) dims);
      ret =jl_get_field(ans,"data");
      ret1 =jl_get_field(ans,"na");

      double *retData = (double *)jl_array_data(ret);
      for (size_t i = 0; i < jl_array_len(ret); i++)
      {
        if (ISNAN(REAL(Var)[i]))
        {
          retData[i] = 999.01;
          jl_call3(setindex,ret1,jl_box_bool(1),jl_box_long(i+1));
        }
        else
        {
          retData[i] = REAL(Var)[i];
          jl_call3(setindex,ret1,jl_box_bool(0),jl_box_long(i+1));
        }
      }
      break;
    }
    case STRSXP:
    {
      if (!ISASCII(Var))
        ans=jl_call2(DataArray,(jl_value_t*) jl_utf8_string_type,(jl_value_t*) dims);
      else
        ans=jl_call2(DataArray,(jl_value_t*) jl_ascii_string_type,(jl_value_t*) dims);

      ret =jl_get_field(ans,"data");
      ret1 =jl_get_field(ans,"na");
      jl_value_t **retData = jl_array_data(ret);
      for (size_t i = 0; i < jl_array_len(ret); i++)
      {
        if (STRING_ELT(Var, i) == NA_STRING)
        {
          retData[i] = jl_cstr_to_string("999");
          jl_call3(setindex,ret1,jl_box_bool(1),jl_box_long(i+1));
        }
        else
        {
          if (!ISASCII(Var))
            retData[i] = jl_cstr_to_string(translateCharUTF8(STRING_ELT(Var, i)));
          else
            retData[i] = jl_cstr_to_string(CHAR(STRING_ELT(Var, i)));
          jl_call3(setindex,ret1,jl_box_bool(0),jl_box_long(i+1));
        }
      }
      break;
    }
    default:
      ans=(jl_value_t *) jl_nothing;
      break;
    }//case end
   if (VarName!=NULL && strlen(VarName)>0)
     jl_set_global(jl_main_module, jl_symbol(VarName), (jl_value_t *)ans);
    JL_GC_POP();
    return ans;
 }
armgong commented 8 years ago

I modify the comment on github ,so please don't read the comment in mail, it will not send the modified message to email

phaverty commented 8 years ago

I like your idea of using the logical array from R generated by is.na and just copying it to julia. That would simplify the rjulia code a lot. It might have to be copied in a loop with a cast as R logicals are int32 and julia bools are int8, right? Still, that would be dramatically simpler!

armgong commented 8 years ago

I think maybe we could avoid use loop by create a julia int32 array , memcopy R array ,convert julia int32 array to int8 array.

julia> X=Array(Int32,10)
10-element Array{Int32,1}:
 -2048134640
           0
 -2048134576
           0
 -2048134512
           0
 -2046260688
           0
 -2048131984
           0

julia> for i in 1:10
        X[i]=i
       end

julia> convert(Array{Int8},X)
10-element Array{Int8,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

julia>
armgong commented 8 years ago

sorry DataArray.na is BitArray not Array{Int8},so we need convert like this

julia> convert(BitArray,X)
10-element BitArray{1}:
 true
 true
 true
 true
 true
 true
 true
 true
 true
 true
armgong commented 8 years ago

or use

julia> BitArray(X)
10-element BitArray{1}:
 true
 true
 true
 true
 true
 true
 true
 true
 true
 true
phaverty commented 8 years ago

Great, maybe we can call BitArray on the Int32 array when we construct the DataArray:

data_array = DataArrays.DataArray( data_array, BitArray(int32_vector) )

?

Pete


Peter M. Haverty, Ph.D. Genentech, Inc. phaverty@gene.com

On Tue, Jun 28, 2016 at 7:59 AM, Yu Gong notifications@github.com wrote:

or use

julia> BitArray(X)10-element BitArray{1}: true true true true true true true true true true

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/armgong/rjulia/issues/33#issuecomment-229076340, or mute the thread https://github.com/notifications/unsubscribe/AH02Kw7HDzmWb0ci-QluvKdaPg7rSdYjks5qQTbqgaJpZM4I_XU4 .

phaverty commented 8 years ago

I have an implementation of these ideas, based on your 0.5 branch over at https://github.com/phaverty/rjulia

Specifically, I use jl_get_function to look up functions and then call them with jl_call2 (etc.). This results in simpler code. I haven't compared the speed yet. (I wonder if the function lookup can be made static so we just pay for that once?).

As part of this, I moved the association of C-level objects with julia global namespace symbols to the top level functions. This allows for functions like R_Juila_MD_NA to be used other places, like in DataFrame creation. Please let me know if you like that change or not.

The current state is that the code for 'r2j' works. I believe my copy of 'j2r' needs some more work.

phaverty commented 8 years ago

The relevant updates have been committed. Closing.