easydatawarehousing / prestoclient

PrestoClient implements the client protocol to communicate with a Presto server. There are versions in C, Python and R.
48 stars 18 forks source link

Is this still under development? #3

Open jeffwong opened 10 years ago

jeffwong commented 10 years ago

Would love to see the C version incorporated into RPresto. Is there a way I can help in development?

easydatawarehousing commented 10 years ago

I didn't give the R version much love and attention recently. Reason being that I wan't to port my easy2oracle tool, which includes fast Presto access, to R. This would render RPresto obsolete.

easy2oracle should already work with R, see this article I wrote: load-excel-data-into-r Just replace the executable name with EasyPTOra and optionally remove the 'offset' lines. You can use the 'table=something' as shown in the example or use 'sql=select x,y,z from table join other_table'.

Sometime in the near future I will create an R installer and binary packages for Windows, RHEL, Ubuntu.

Let me know if this works for you. Regards, Ivo

easydatawarehousing commented 10 years ago

I have been looking into adapting the C version of the presto client for R. It works flawlessly, but it is about as slow as the native R version. This is due to the fact that I am using the read.delim function, which is very slow. Changing the code a bit to use the scan function would speed things up a little. But still more than 10 times slower than for instance the fread function from the data.table package. This last function is written in C and calls c level R functions.

At the moment I can see 3 possible solutions:

  1. write dedicated C code that integrates Presto client with R
  2. create a database driver for the R 'DBI' package. Sounds more complicated than it is since you can copy an existing driver (like RSQLite) and change the code where necessary. This would also be C code
  3. Wait till the Presto ODBC driver becomes usable. Use DBI and RODBC packages in R. Don't know how this will perform.

My personal favorite is option 2. But at the moment I do not have the time to implement it.

jeffwong commented 9 years ago

Hi, wondering if you have made any progress on integrating the C library with R? These functions are still significantly faster than the ones found in https://github.com/prestodb/RPresto. I built the C version using cmake and just called it from inside R using system(cprestoclient server query) - while sloppy it is very fast. When you say the C version of the presto client works flawlessly, do you mean that the C version is stable/robust?

easydatawarehousing commented 9 years ago

Hi. The answer is no and the reason for this is that other developments rendered this unnecessary. Some guys at facebook developed a dbi driver for R (option 2 from my list) a couple of months ago. I have documented this in the readme for my R client. The odbc driver available developed by facebook seems stable. And I read there is another odbc driver in the making by teradata. All in all I think any efforts in this field on my part would be in vain. With flawlessly I meant wrapping the C client in R works as expected and without errors.

jeffwong commented 9 years ago

Unfortunately the dbi driver is slow, I think if this code becomes stable it will have a lot of value. Using RPresto from the prestodb/RPresto repo takes 45 seconds to pull 100k rows and 22 columns. Using your python interface to dump the data to a text file, then read it back into R, takes a total of 6 seconds (for both steps), and using your C interface in the same way it takes 3 seconds. On larger queries the performance difference makes the data workflow difficult. If you have ideas of where the C version could go in terms of stability I can offer to help, I am not looking for a clean R package with all the C files well organized, a solution that uses system("cprestoclient server query") would do well as long as you felt it was stable. Do you have any fresh ideas of what can be improved?

easydatawarehousing commented 9 years ago

Wow, that is a surprisingly big difference. My experiments in the past with reading big/huge text files into R showed me that R simply is very slow to convert text into its internal memory structures. I haven't used dbi much, but always assumed it would be faster since it has direct access to creating data.frame's. Apparently the dbi overhead is huge.

As an example, the data.table R package has its own methods for creating R memory structures and is way faster as a result.

As for stability of the C presto client: in my mind it is ready for production use since I have never found any errors in it, nor have seen any errors reported. But keep in mind that the userbase is probably very small ;) If the 'dump to txt' + 'read into R' works for you then why not keep using that? You may also take a look at my other repo 'Easy2Oracle'. Don't be mislead by its name, it is not just for oracle. There is also a Presto version in there, using the same C client. What it does is: an executable takes the name of an .ini file as input. This file should contain the db name and the query. It executes the query and sends the results back as a stream of text. You can read this stream into R. So no in-between storage of text files.

I did an blog article long ago showing how it works: load-excel-data-into-r Just replace the name of the executable and it should work.

I never pursued this any further because the performance bottleneck still is R itself. Any help is of course welcome, but i'm not sure how to proceed.