Closed ghost closed 6 years ago
Hi @davidfjr3, looking through the rClr.c
code, the limitation appears to be in the clr_obj_ms_convert_to_SEXP
function. (Sorry I don't know how to reference line numbers with codeplex's web interface.)
It looks like the default
clause is catching your condition, which makes sense because I don't see the numeric
/decimal
condition supported.
Two questions
numeric
data type in R? Notice it's a very different animal than R's numeric
data type. The equivalent data type in SQL Server is float
. The round()
function used in the benchmarking vignette you referenced is producing the equivalent of a SQL Server float
.Hi @wibeasley,
I really believed that to read the data of the SQL database server in numeric format would be as fast as we could see in benchmarking vignette.
Your tip to convert the fileds char into a float type worked very well .. and from there reading the numerical data from SQL Server to the R reduced the reading time by 65%.
Thank you for your very helpful participation.
@davidfjr3, no problem. I've really benefited from this package, and I'm happy to help.
Is this field representing money or something that needs to be extremely precise? If not, consider keeping the data types as floats in the database.
Did you loose any/much precision? Theoretically if the values aren't well represented in powers-of-two. If it's a concern, consider uploading both versions to the same table and subtract the difference (as DECIMAL/NUMERIC types) to see if you're losing anything.
Like you, I'm surprised there was a substantial speed difference. I had assumed that the bcp utility was mostly text based underneath. But now that you point it out, I see the vignette's vertical axes change for the horizontal facets. I'm not sure the difference is big enough for me to change from my preferred data type choice for each variable, but I'm glad I'm more aware of it.
Hi, @wibeasley
no, I should not have extreme precision.
Now, let me show a summary of the conditions of my data and show the scenario that I got this performance gain:
VT_DECIMAL is the decimal type that may be missing in rClr.c. That being said, you can write a stored procedure that will automatically convert any decimal column to a float and then call it from R similar to
df=dbGetQuery(connection, "exec sp_RSqlServer_Select @Table='Time_Series', @Schema='dbo'")
You also could add @Where parameter as well as parameters to order the dataset, etc.
The first block related to VT_DECIMAL should be
case VT_DECIMAL:
rVals = (double*)malloc(sizeof(double)*n);
if (pobj->decVal.sign == 128)
rVals[0] = -(pobj->decVal.Lo32)*pow(.1, pobj->decVal.scale);
else
rVals[0] = (pobj->decVal.Lo32)*pow(.1, pobj->decVal.scale);
result = make_numeric_sexp(n, rVals);
free(rVals);
break;
The block related to VT_ARRAY|VT_DECIMAL might be a modest modification of the code below, but I'd have to get rsqlserver configured in Visual Studio to figure it out. I tried replacing rClr's ClrFacade.dll, but that was not enough.
case VT_ARRAY | VT_DECIMAL : get_array_variant(pobj, &array, &n, &uBound); rVals = (double)malloc(sizeof(double)n); for(long i = 0; i < n ; i++ ) { SafeArrayGetElement(array, &i, &(rVals[i])); } result = make_numeric_sexp(n, rVals); free(rVals); break;
That's a good option if you must run the transformation in-database.
As a result of the performance comparisons above, the rsqlserver function dbBulkWrite
was added to be the fastest way to dump tables with erroneous data types as character types to CSV before reading in with fread
My data table is recorded in SQL Server 2008 database and it contains only the types listed below:
However, when trying to run the script below an error occurs with the type of data (numerical). I solved the problem inefficiently turning all the table fields in char type, which does not allow the best performance rsqlserver package (see about the performance, according to the test here for the numerical data rsqlserver package is much faster, https://github.com/agstudy/rsqlserver/wiki/benchmarking).
Although I have researched a lot about, I found nothing about how to resolve this problem, do not even exist comments for the issue #22, and I believe it is a similar problem to mine.
So I would like to correct this error and gain in performance.
My code:
Pacakge
Driver
My connection
Data reading attempt
Error message