agstudy / rsqlserver

Sql Server driver database interface (DBI) driver for R. This is a DBI-compliant Sql Server driver based on the System.Data.SqlClient.
82 stars 30 forks source link

can't read numeric fields (again) #26

Closed ghost closed 6 years ago

ghost commented 8 years ago

My data table is recorded in SQL Server 2008 database and it contains only the types listed below:

However, when trying to run the script below an error occurs with the type of data (numerical). I solved the problem inefficiently turning all the table fields in char type, which does not allow the best performance rsqlserver package (see about the performance, according to the test here for the numerical data rsqlserver package is much faster, https://github.com/agstudy/rsqlserver/wiki/benchmarking).

Although I have researched a lot about, I found nothing about how to resolve this problem, do not even exist comments for the issue #22, and I believe it is a similar problem to mine.

So I would like to correct this error and gain in performance.

My code:

Pacakge

library(rsqlserver)     

Driver

drv  <- dbDriver("SqlServer")   

My connection

conn <- dbConnect(drv, 
                  host     = "my_ip",
                  dbname   = "mydb", 
                  user     = "my_user", 
                  password = "my_pass") 

Data reading attempt

df <- dbGetQuery(conn, "select * FROM my_table")

Error message

Error in (function ()  : 
  clr_obj_ms_convert_to_SEXP: COM variant type code 14 unsupported. Returning NULL
Error in clrCall(res.Dict, "get_Item", Cnames[i]) : 
  Type:    RDotNet.EvaluationException
Message: Error in (function ()  : 
  clr_obj_ms_convert_to_SEXP: COM variant type code 14 unsupported. Returning NULL

Method:  RDotNet.ProtectedPointer evaluateCall(IntPtr)
Stack trace:
   at RDotNet.Function.evaluateCall(IntPtr call)
   at RDotNet.Function.createCallAndEvaluate(IntPtr argument)
   at RDotNet.Function.InvokeOrderedArguments(SymbolicExpression[] args)
   at Rclr.RDotNetDataConverter.ConvertObject(Object obj) in c:\Users\per202\AppData\Local\Temp\Rtmp8UJrfL\R.INSTALL23944e8d7aab\rClr\src\ClrFacade\RDotNetDataConverter.cs:line 558
   at Rclr.RDotNetDataConverter.ConvertToSexp(Object obj) in c:\Users\per202\AppData\Local\Temp\Rtmp8UJrfL\R.INSTALL23944e8d7aab\rClr\src\ClrFacade\RDotNetDataConverter.cs:line 368
   at Rclr.RDotNetDataConverter.ConvertAll(Object[] objects, Func`2 converter) in c:\Users\per202\AppData\Local\Temp\Rtmp8UJrfL\R.INSTALL23944e8d7aab\rClr\src\ClrFacade\RDotNetDataConverter.cs:line 387
   at Rclr.RDotNetD 
wibeasley commented 8 years ago

Hi @davidfjr3, looking through the rClr.c code, the limitation appears to be in the clr_obj_ms_convert_to_SEXP function. (Sorry I don't know how to reference line numbers with codeplex's web interface.)

It looks like the default clause is catching your condition, which makes sense because I don't see the numeric/decimal condition supported.

Two questions

  1. Are you sure you want SQL Server's numeric data type in R? Notice it's a very different animal than R's numeric data type. The equivalent data type in SQL Server is float. The round() function used in the benchmarking vignette you referenced is producing the equivalent of a SQL Server float.
  2. If the answer is yes above, I guess I don't understand how you're supporting it in R that's not already in a character format. Could
ghost commented 8 years ago

Hi @wibeasley,

I really believed that to read the data of the SQL database server in numeric format would be as fast as we could see in benchmarking vignette.

Your tip to convert the fileds char into a float type worked very well .. and from there reading the numerical data from SQL Server to the R reduced the reading time by 65%.

Thank you for your very helpful participation.

wibeasley commented 8 years ago

@davidfjr3, no problem. I've really benefited from this package, and I'm happy to help.

Is this field representing money or something that needs to be extremely precise? If not, consider keeping the data types as floats in the database.

Did you loose any/much precision? Theoretically if the values aren't well represented in powers-of-two. If it's a concern, consider uploading both versions to the same table and subtract the difference (as DECIMAL/NUMERIC types) to see if you're losing anything.

Like you, I'm surprised there was a substantial speed difference. I had assumed that the bcp utility was mostly text based underneath. But now that you point it out, I see the vignette's vertical axes change for the horizontal facets. I'm not sure the difference is big enough for me to change from my preferred data type choice for each variable, but I'm glad I'm more aware of it.

ghost commented 8 years ago

Hi, @wibeasley

no, I should not have extreme precision.

Now, let me show a summary of the conditions of my data and show the scenario that I got this performance gain:

ghost commented 6 years ago

VT_DECIMAL is the decimal type that may be missing in rClr.c. That being said, you can write a stored procedure that will automatically convert any decimal column to a float and then call it from R similar to

df=dbGetQuery(connection, "exec sp_RSqlServer_Select @Table='Time_Series', @Schema='dbo'")

You also could add @Where parameter as well as parameters to order the dataset, etc.

The first block related to VT_DECIMAL should be

case VT_DECIMAL:
    rVals = (double*)malloc(sizeof(double)*n);
    if (pobj->decVal.sign == 128)
        rVals[0] = -(pobj->decVal.Lo32)*pow(.1, pobj->decVal.scale);
    else
        rVals[0] = (pobj->decVal.Lo32)*pow(.1, pobj->decVal.scale);
    result = make_numeric_sexp(n, rVals);
    free(rVals);
    break;

The block related to VT_ARRAY|VT_DECIMAL might be a modest modification of the code below, but I'd have to get rsqlserver configured in Visual Studio to figure it out. I tried replacing rClr's ClrFacade.dll, but that was not enough.

case VT_ARRAY | VT_DECIMAL : get_array_variant(pobj, &array, &n, &uBound); rVals = (double)malloc(sizeof(double)n); for(long i = 0; i < n ; i++ ) { SafeArrayGetElement(array, &i, &(rVals[i])); } result = make_numeric_sexp(n, rVals); free(rVals); break;

ruaridhw commented 6 years ago

That's a good option if you must run the transformation in-database.

As a result of the performance comparisons above, the rsqlserver function dbBulkWrite was added to be the fastest way to dump tables with erroneous data types as character types to CSV before reading in with fread