Open jeffwong opened 10 years ago
I didn't give the R version much love and attention recently. Reason being that I wan't to port my easy2oracle tool, which includes fast Presto access, to R. This would render RPresto obsolete.
easy2oracle should already work with R, see this article I wrote: load-excel-data-into-r Just replace the executable name with EasyPTOra and optionally remove the 'offset' lines. You can use the 'table=something' as shown in the example or use 'sql=select x,y,z from table join other_table'.
Sometime in the near future I will create an R installer and binary packages for Windows, RHEL, Ubuntu.
Let me know if this works for you. Regards, Ivo
I have been looking into adapting the C version of the presto client for R. It works flawlessly, but it is about as slow as the native R version. This is due to the fact that I am using the read.delim function, which is very slow. Changing the code a bit to use the scan function would speed things up a little. But still more than 10 times slower than for instance the fread function from the data.table package. This last function is written in C and calls c level R functions.
At the moment I can see 3 possible solutions:
My personal favorite is option 2. But at the moment I do not have the time to implement it.
Hi, wondering if you have made any progress on integrating the C library with R? These functions are still significantly faster than the ones found in https://github.com/prestodb/RPresto. I built the C version using cmake and just called it from inside R using system(cprestoclient server query) - while sloppy it is very fast. When you say the C version of the presto client works flawlessly, do you mean that the C version is stable/robust?
Hi. The answer is no and the reason for this is that other developments rendered this unnecessary. Some guys at facebook developed a dbi driver for R (option 2 from my list) a couple of months ago. I have documented this in the readme for my R client. The odbc driver available developed by facebook seems stable. And I read there is another odbc driver in the making by teradata. All in all I think any efforts in this field on my part would be in vain. With flawlessly I meant wrapping the C client in R works as expected and without errors.
Unfortunately the dbi driver is slow, I think if this code becomes stable it will have a lot of value. Using RPresto from the prestodb/RPresto repo takes 45 seconds to pull 100k rows and 22 columns. Using your python interface to dump the data to a text file, then read it back into R, takes a total of 6 seconds (for both steps), and using your C interface in the same way it takes 3 seconds. On larger queries the performance difference makes the data workflow difficult. If you have ideas of where the C version could go in terms of stability I can offer to help, I am not looking for a clean R package with all the C files well organized, a solution that uses system("cprestoclient server query") would do well as long as you felt it was stable. Do you have any fresh ideas of what can be improved?
Wow, that is a surprisingly big difference. My experiments in the past with reading big/huge text files into R showed me that R simply is very slow to convert text into its internal memory structures. I haven't used dbi much, but always assumed it would be faster since it has direct access to creating data.frame's. Apparently the dbi overhead is huge.
As an example, the data.table R package has its own methods for creating R memory structures and is way faster as a result.
As for stability of the C presto client: in my mind it is ready for production use since I have never found any errors in it, nor have seen any errors reported. But keep in mind that the userbase is probably very small ;) If the 'dump to txt' + 'read into R' works for you then why not keep using that? You may also take a look at my other repo 'Easy2Oracle'. Don't be mislead by its name, it is not just for oracle. There is also a Presto version in there, using the same C client. What it does is: an executable takes the name of an .ini file as input. This file should contain the db name and the query. It executes the query and sends the results back as a stream of text. You can read this stream into R. So no in-between storage of text files.
I did an blog article long ago showing how it works: load-excel-data-into-r Just replace the name of the executable and it should work.
I never pursued this any further because the performance bottleneck still is R itself. Any help is of course welcome, but i'm not sure how to proceed.
Would love to see the C version incorporated into RPresto. Is there a way I can help in development?