DiskFrame / disk.frame

Fast Disk-Based Parallelized Data Manipulation Framework for Larger-than-RAM Data
https://diskframe.com
Other
594 stars 40 forks source link

Subsetting a large data on SQL Server #320

Closed ehcarine closed 3 years ago

ehcarine commented 3 years ago

I am trying to analyze 1.1 billions of medical claims data (multiple rows per patient over 20 years) via VM in a secure research environment with limited number of processors (4) and RAM (16GB). The main goal of this project is to identify all observations meeting certain conditions and the corresponding dates. Even after selecting the columns required to flag these cases, I had no luck of creating disk frames due to memory limitation and tried to find a workaround solution.

I was wondering if there is a way to automate the following 2-stage of subsetting:

  1. Loading a subset of data on the SQL server (saved as a data frame) by dbGetQuery within in a for-loop or apply functions
  2. Creating disk frame for each subset and combining

I initially tried creating 100 ID files (subdividing the patient IDs over all years), therefore, to use a join statements by loading the data for the individuals within in each ID file one at a time... However, I wasn't sure if there was a better way to optimize the process. It seems a bit tricky as I have to incorporate two levels of chunks, incorporating a "chewable" dataset queried from the SQL server. Do you have any examples or advice?

Thank you so much for your help in advance.

xiaodaigh commented 3 years ago
  1. Loading a subset of data on the SQL server (saved as a data frame) by dbGetQuery within in a for-loop or apply functions

There isn't a built in way to load data from a table. But I made some examples before which is hard to find.

But you can do something like this


## setup a dummy database
library(RSQLite)
library(DBI)

con <- dbConnect(RSQLite::SQLite(), ":memory:")

dbWriteTable(con, "iris", iris)

Now I will load 50 rows from the data one at a time create a disk.frame

library(disk.frame)
diskf <- disk.frame(tempfile())

rs <- dbSendQuery(con, 'SELECT * FROM iris')

while (!dbHasCompleted(rs)) {
  df <- dbFetch(rs, n = 50)
  print(nrow(df))

  add_chunk(diskf, df)
}

dbClearResult(rs)

diskf

This has been asked many times and I will make a helper function to do this. Thanks.

xiaodaigh commented 3 years ago

I initially tried creating 100 ID files (subdividing the patient IDs over all years), therefore, to use a join statements by loading the data for the individuals within in each ID file one at a time... However, I wasn't sure if there was a better way to optimize the process. It seems a bit tricky as I have to incorporate two levels of chunks, incorporating a "chewable" dataset queried from the SQL server. Do you have any examples or advice?

Not quite sure if I understand this. But I think you ar saying that you are load from the database 100 times each time using a different query like select * from table where ID = "ID1"

ehcarine commented 3 years ago

I initially tried creating 100 ID files (subdividing the patient IDs over all years), therefore, to use a join statements by loading the data for the individuals within in each ID file one at a time... However, I wasn't sure if there was a better way to optimize the process. It seems a bit tricky as I have to incorporate two levels of chunks, incorporating a "chewable" dataset queried from the SQL server. Do you have any examples or advice?

Not quite sure if I understand this. But I think you are saying that you are load from the database 100 times each time using a different query like select * from table where ID = "ID1"

Yes, that was essentially the idea - multiple times of reading in the data to achieve the same goal just because the initial data on the server was too big!

  1. Loading a subset of data on the SQL server (saved as a data frame) by dbGetQuery within in a for-loop or apply functions

There isn't a built in way to load data from a table. But I made some examples before which is hard to find.

But you can do something like this


## setup a dummy database
library(RSQLite)
library(DBI)

con <- dbConnect(RSQLite::SQLite(), ":memory:")

dbWriteTable(con, "iris", iris)

Now I will load 50 rows from the data one at a time create a disk.frame

library(disk.frame)
diskf <- disk.frame(tempfile())

rs <- dbSendQuery(con, 'SELECT * FROM iris')

while (!dbHasCompleted(rs)) {
  df <- dbFetch(rs, n = 50)
  print(nrow(df))

  add_chunk(diskf, df)
}

dbClearResult(rs)

diskf

This has been asked many times and I will make a helper function to do this. Thanks.

Thank you so much for the example code. It's very helpful.

xiaodaigh commented 3 years ago

so much for the example code

Since this gets asked a lot I am preparing a function to load data from DBMS using DBI.