OHDSI / Andromeda

AsynchroNous Disk-based Representation of MassivE DAta: An R package aimed at replacing ff for storing large data objects.
https://ohdsi.github.io/Andromeda/
11 stars 9 forks source link

Date is changed to integer on selecting using dplyr #28

Open gowthamrao opened 2 years ago

gowthamrao commented 2 years ago

Observed

anAndromedaObject <- Andromeda::andromeda()
anAndromedaObject$dataFrameWithDate <- dplyr::tibble(todayDate = Sys.Date(),
                                                     anotherDate = Sys.Date())
anAndromedaObject$dataFrameWithDate

todayDate anotherDate

1 2021-08-10 2021-08-10
anAndromedaObject$dataFrameWithDate <- anAndromedaObject$dataFrameWithDate %>% 
  dplyr::select(.data$anotherDate, .data$todayDate)
anAndromedaObject$dataFrameWithDate

anotherDate todayDate

1 18849 18849
ablack3 commented 2 years ago

Still trying to figure out why this is happening.

library(Andromeda)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# this works
anAndromedaObject <- andromeda()
x <- dplyr::tibble(todayDate = Sys.Date(), anotherDate = Sys.Date())
anAndromedaObject$dataFrameWithDate <- dplyr::tibble(todayDate = Sys.Date(), anotherDate = Sys.Date())
anAndromedaObject$dataFrameWithDate
#> # Source:   table<dataFrameWithDate> [?? x 2]
#> # Database: sqlite 3.35.5
#> #   [C:\Users\ADAM~1.DES\AppData\Local\Temp\RtmpA1JwqP\file2ec775a37e9.sqlite]
#>   todayDate  anotherDate
#>   <date>     <date>     
#> 1 2021-10-05 2021-10-05

# this does not
anAndromedaObject$dataFrameWithDate2 <- anAndromedaObject$dataFrameWithDate %>% 
  dplyr::select(anotherDate, todayDate)

anAndromedaObject$dataFrameWithDate2 
#> # Source:   table<dataFrameWithDate2> [?? x 2]
#> # Database: sqlite 3.35.5
#> #   [C:\Users\ADAM~1.DES\AppData\Local\Temp\RtmpA1JwqP\file2ec775a37e9.sqlite]
#>   anotherDate todayDate
#>         <int>     <int>
#> 1       18905     18905

# but this does
dataFrameWithDate2 <- anAndromedaObject$dataFrameWithDate %>% 
  dplyr::select(anotherDate, todayDate) %>% 
  collect()

dataFrameWithDate2
#> # A tibble: 1 x 2
#>   anotherDate todayDate 
#>   <date>      <date>    
#> 1 2021-10-05  2021-10-05

anAndromedaObject$dataFrameWithDate2 <- dataFrameWithDate2

anAndromedaObject$dataFrameWithDate2
#> # Source:   table<dataFrameWithDate2> [?? x 2]
#> # Database: sqlite 3.35.5
#> #   [C:\Users\ADAM~1.DES\AppData\Local\Temp\RtmpA1JwqP\file2ec775a37e9.sqlite]
#>   anotherDate todayDate 
#>   <date>      <date>    
#> 1 2021-10-05  2021-10-05

Created on 2021-10-05 by the reprex package (v2.0.1)

ablack3 commented 2 years ago
dateTable <- dplyr::tibble(todayDate = Sys.Date(), anotherDate = Sys.Date())

con <- RSQLite::dbConnect(RSQLite::SQLite(), ":memory:", extended_types = TRUE)

RSQLite::dbWriteTable(con, "table1", dateTable)

RSQLite::dbGetQuery(con, "select * from table1")
#>    todayDate anotherDate
#> 1 2021-10-05  2021-10-05

RSQLite::dbExecute(con, "create table table2 as select * from table1")
#> [1] 0

RSQLite::dbGetQuery(con, "select * from table2")
#>   todayDate anotherDate
#> 1     18905       18905

RSQLite::dbDisconnect(con)

Created on 2021-10-05 by the reprex package (v2.0.1)

Seem like an RSQLite issue. Made a note here https://github.com/r-dbi/RSQLite/issues/379. I'll try to come up with a temporary fix but I'm really not sure how to do this without bringing the data into R. I'm assuming "create a modified copy" must all happen in SQLite.

ablack3 commented 2 years ago

I'm pretty tempted to swap out SQLite for duckdb which has real date support.

dateTable <- dplyr::tibble(todayDate = wakefield::date_stamp(10, T), anotherDate = wakefield::date_stamp(10, T))

con <- duckdb::dbConnect(duckdb::duckdb(), ":memory:")

DBI::dbWriteTable(con, "table1", dateTable)

DBI::dbGetQuery(con, "select * from table1")
#>     todayDate anotherDate
#> 1  2021-03-05  2020-12-05
#> 2  2021-01-05  2021-02-05
#> 3  2021-10-05  2021-08-05
#> 4  2021-03-05  2021-09-05
#> 5  2020-12-05  2021-10-05
#> 6  2021-07-05  2021-01-05
#> 7  2021-05-05  2021-09-05
#> 8  2020-12-05  2021-10-05
#> 9  2020-11-05  2021-07-05
#> 10 2021-07-05  2021-10-05

DBI::dbExecute(con, "create table table2 as select * from table1")
#> [1] 10

DBI::dbGetQuery(con, "select * from table2")
#>     todayDate anotherDate
#> 1  2021-03-05  2020-12-05
#> 2  2021-01-05  2021-02-05
#> 3  2021-10-05  2021-08-05
#> 4  2021-03-05  2021-09-05
#> 5  2020-12-05  2021-10-05
#> 6  2021-07-05  2021-01-05
#> 7  2021-05-05  2021-09-05
#> 8  2020-12-05  2021-10-05
#> 9  2020-11-05  2021-07-05
#> 10 2021-07-05  2021-10-05

DBI::dbDisconnect(con)

Created on 2021-10-05 by the reprex package (v2.0.1)

schuemie commented 2 years ago

Have you compared performance (mostly speed) between SQLite and DuckDB?

ablack3 commented 2 years ago

I'm working on creating a duckdb version of Andromeda. Then I'll run some comparison tests.

I don't understand the purpose for these tests. Is it important that we can append with mismatched column names. duckdb throws an error when I try to do this.

test_that("Append from other andromeda with switched column order", {
  andromeda <- andromeda()
  andromeda$cars <- cars

  andromeda2 <- andromeda()
  andromeda2$cars <- cars
  appendToTable(andromeda$cars, andromeda2$cars %>% filter(speed > 10) %>% select(dist, speed))

  carsPlus2 <- andromeda$cars %>% collect()

  carsPlus <- rbind(cars, cars[cars$speed > 10, ])
  expect_equivalent(carsPlus2, carsPlus)
  close(andromeda)
  close(andromeda2)
})

test_that("Append from data frame with switched column order", {
  andromeda <- andromeda()
  andromeda$cars <- cars

  appendToTable(andromeda$cars, cars[cars$speed > 10, c("dist", "speed")])

  carsPlus2 <- andromeda$cars %>% collect()

  carsPlus <- rbind(cars, cars[cars$speed > 10, ])
  expect_equivalent(carsPlus2, carsPlus)
  close(andromeda)
})
ablack3 commented 2 years ago

Test using SQLite version

library(Andromeda)
library(wakefield)
summarizeSomeData <- function(){
  db <- andromeda(survey = r_data_theme(1e6, "survey"))

  db$survey %>% 
    group_by(Item_1) %>% 
    summarize(mean_Item_2 = mean(Item_2, na.rm = TRUE))
}

microbenchmark::microbenchmark(
  summarizeSomeData(), 
  times = 5)
#> Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
#> Please use `tibble::as_tibble()` instead.
#> Unit: seconds
#>                 expr      min       lq     mean   median       uq      max
#>  summarizeSomeData() 2.585798 2.605132 2.822737 2.642374 2.674033 3.606348
#>  neval
#>      5

Test using duckDB version

library(Andromeda)
library(wakefield)
summarizeSomeData <- function(){
  db <- andromeda(survey = r_data_theme(1e6, "survey"))

  db$survey %>% 
    group_by(Item_1) %>% 
    summarize(mean_Item_2 = mean(Item_2, na.rm = TRUE))
}

microbenchmark::microbenchmark(
  summarizeSomeData(), 
  times = 5)
#> Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
#> Please use `tibble::as_tibble()` instead.
#> Unit: seconds
#>                 expr      min       lq     mean   median       uq      max
#>  summarizeSomeData() 2.685436 2.692851 2.992721 2.711518 2.725933 4.147866
#>  neval
#>      5

Created on 2021-10-06 by the reprex package (v2.0.1)

So they were basically the same on this very simple test that included loading data in and querying. I still have a work to do to get all the andromeda functions working on duckdb. Is there a test you'd like me try? Perhaps a particularly slow step of your normal workflow.

duckdb branch is at https://github.com/OHDSI/Andromeda/tree/duckdb

schuemie commented 2 years ago

Could you test joins between large tables? That is often slow.

schuemie commented 2 years ago

About the test with different column orders: this is guarantee that I think all tidyverse functions also make (e.g. bind_rows()). If I have data with columns conceptId, count, and I append to that data with columns count, conceptId, it will match the data on column name, not column order, so at the end the conceptId column still only has concept IDs, and the count column only has counts.

ablack3 commented 2 years ago

I did a few simple tests on joins both with and without indexes and posted one result here. SQLite was faster in all my tests. According the FAQ duckdb has not been fully optimized for performance yet. I'll keep working on implementing a duckdb Andromeda version on the duckdb branch and we can see if it gets better at some point.

You could use the restoreDate() function (which I should un-deprecate) but that requires bringing the data into R which is probably not what we want.

Any ideas for how to make date handling better in Andromeda? Maybe RSQLite should handle this by keeping track of date columns. This seems like a hard thing to do since RSQLite would have to infer the type of new columns from SQL.

schuemie commented 2 years ago

This seems like a SQLite issue, right? Apparently the 'extended types' extension only applies to columns of physical tables (permanent or temporary), but don't exist in any intermediate results like (nested) queries.

ablack3 commented 2 years ago

Yes I guess RSQLite should handle this, but it's not clear to me if it will be possible to propagate date types in more complex queries. I posted an example on the RSQLite issue here.

ablack3 commented 1 year ago

This will be addressed in the next release (arrow_S4 branch)

library(Andromeda)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
anAndromedaObject <- Andromeda::andromeda()
anAndromedaObject$dataFrameWithDate <- dplyr::tibble(todayDate = Sys.Date(),
                                                     anotherDate = Sys.Date())
collect(anAndromedaObject$dataFrameWithDate)
#> # A tibble: 1 × 2
#>   todayDate  anotherDate
#>   <date>     <date>     
#> 1 2022-11-20 2022-11-20

anAndromedaObject$dataFrameWithDate <- anAndromedaObject$dataFrameWithDate %>% 
  dplyr::select(.data$anotherDate, .data$todayDate)
collect(anAndromedaObject$dataFrameWithDate)
#> # A tibble: 1 × 2
#>   anotherDate todayDate 
#>   <date>      <date>    
#> 1 2022-11-20  2022-11-20

Created on 2022-11-20 with reprex v2.0.2