WinVector / rqdatatable

Implement the rquery piped query algebra in R using data.table. Distributed under choice of GPL-2 or GPL-3 license.
https://winvector.github.io/rqdatatable/
Other
37 stars 3 forks source link

Recommended way to input large data sets into rqdatatable from Google BigQuery? #4

Closed heoa closed 5 years ago

heoa commented 5 years ago

What is the recommended way to input big data from Google BigQuery into rqdatatable?

I tested

library(rqdatatable)
library(bigrquery)

bigrquery::bq_auth(path= "~/key.json")
con <- bigrquery::dbConnect(bigrquery::bigquery(), project ='6666666666666', path= "~/key.json",
                      dataset='777')

rs <- bigrquery::dbSendQuery(con, "SELECT * FROM myDB.myTAB;")
res<- bigrquery::dbFetch(rs) # THIS STEP OVERUSES RAM WITH LARGE DATA SAMPLE

optree <- local_td(res)                                 %.>%
  extend_nse(.,revSes := sum(revenue)/sum(SessionIDs), partitionby = c('date','id')) %.>%
  normalize_cols(.,"revSes", partitionby = c('date','id'))%.>%
  select_columns(.,c('cat','date','id')) %.>%
  orderby(.,cols=c('cat', 'date', id'))

res %.>% optree

but running into memory problems such as "Error in file(path, open = "wb") : cannot open the connection" ('"tmp"has only 0 bytes disk space remaining', overusing local tmp) and "Exceeded rate limits" (BigQuery error with very large datasets), already in the step dbFetch(rs) to be used in local_td(res) for the optree .

Is there some way to use rqdatatable without the step of overusing RAM in downloading the massive datasets into RAM? Is there some way to make the algorithms to use SSD disks instead of RAM?

JohnMount commented 5 years ago

The recommended way to work with big data in Google BigQuery is to send the query from R to Google BigQuery instead of moving the data from Google BigQuery to R. To do this you would not use the bigrquery::dbFetch(), and not use rqdatatable. Instead you would use the rquery package and use mk_td() to define a table description containing the name of the remote table and the remote column names. Then instead of res %.>% optree one would try rquery::materialize() to run the query remotely and then land the result as a new table.

That being said, we haven't tested rquery with Google Big Table, as a we haven't had a client ask for that yet. If/when we do that we would find and fix any issues particular to Google Big Table's SQL dialect and create and distribute an rquery per-database SQL adapters. The databases we have worked with are listed here.