Bioconductor / SQLDataFrame

DataFrame representation of SQL database table
GNU Lesser General Public License v3.0
2 stars 3 forks source link

mutate throws "Error: row value misused" #3

Closed jorainer closed 4 years ago

jorainer commented 4 years ago

Dear developers, I wanted to add columns to an existing SQLDataFrame but that does not seem to work:

> library(SQLDataFrame)
> library(RSQLite)
> dframe <- data.frame(pkey = 1:5,
+                      letters = c("a", "b", "c", "d", "e"),
+                      numbers = rnorm(5),
+                      stringsAsFactors = FALSE)
> con <- dbConnect(SQLite(), tempfile())
> dbWriteTable(con, name = "test_table", dframe)
> sdf <- SQLDataFrame(con, dbtable = "test_table", dbkey = "pkey")
> sdf
SQLDataFrame with 5 rows and 2 columns
      pkey |     letters    numbers
 <integer> | <character>  <numeric>
         1 |           a -0.5388634
         2 |           b  0.3230734
         3 |           c -1.2372244
         4 |           d  1.3798320
         5 |           e -1.9733573
> sdf_2 <- mutate(sdf, new_col = c("b", "a", "b", "r", "g"))
> sdf_2
SQLDataFrame with 5 rows and 3 columns
Error: row value misused
> as.data.frame(sdf_2)
Error: row value misused

Seems this error is thrown by SQLite. Thankful for any comment on this.

My session info:

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] RSQLite_2.2.0       SQLDataFrame_1.0.0  S4Vectors_0.24.4   
[4] BiocGenerics_0.32.0 dbplyr_1.4.2        dplyr_0.8.5        
[7] BiocManager_1.30.10

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4       magrittr_1.5     bit_1.1-15.2     tidyselect_1.0.0
 [5] R6_2.4.1         rlang_0.4.5      fansi_0.4.1      blob_1.2.1      
 [9] tools_3.6.3      cli_2.0.2        DBI_1.1.0        ellipsis_0.3.0  
[13] lazyeval_0.2.2   digest_0.6.25    bit64_0.9-7      assertthat_0.2.1
[17] tibble_3.0.0     lifecycle_0.2.0  crayon_1.3.4     purrr_0.3.3     
[21] vctrs_0.2.4      memoise_1.1.0    glue_1.4.0       compiler_3.6.3  
[25] pillar_1.4.3     pkgconfig_2.0.3 
LiNk-NY commented 4 years ago

Thank you for the bug report. Tagging @Liubuntu

Liubuntu commented 4 years ago

HI @jorainer

This error is actually due to the inability of mutate.tbl_lazy* defined in dbplyr in adding new columns with fixed-length values. The SQLDataFrame is internally using dbplyr in lazily representing the database tables. The @tblData slot is of class tbl_lazy, and the mutate.SQLDataFrame directly calls the mutate.tbl_lazy for processing.

I can recreate your error. But the mutate.tbl_lazy actually works when the value is of length 1, and operations of existing columns. Sorry that the function documentation is not clearly stating this situation, I'll modify it (updated in 1.1.1).

> tblData(sdf)
# Source:   table<test_table> [?? x 3]                                                                                                                                                                      
# Database: sqlite 3.30.1                                                                                                                                                                                   
#   [/var/folders/7t/9l4kkf_j2sqbpn321y9g5558z96ck_/T//RtmpkrJIF9/file162c922eebc11]                                                                                                                        
   pkey letters numbers
  <int> <chr>     <dbl>
1     1 a        -0.941
2     2 b         1.76
3     3 c         1.58
4     4 d        -1.28
5     5 e        -2.77

> sdf_2 <- mutate(sdf, new_col = c("b", "a", "b", "r", "g"))
> sdf_2
SQLDataFrame with 5 rows and 3 columns
Error: row value misused

library(dplyr)
tibble(dframe) %>% mutate(new_col = c("b", "a", "b", "r", "g"))
# A tibble: 5 x 2                                                                                                                                                                                           
  dframe$pkey $letters $numbers new_col
        <int> <chr>       <dbl> <chr>
1           1 a          -0.941 b
2           2 b           1.76  a
3           3 c           1.58  b
4           4 d          -1.28  r
5           5 e          -2.77  g

> library(dbplyr)
> dat <- tbl(con, "test_table")
> dat %>% mutate(new_col = c("b", "a", "b", "r", "g"))
Error: row value misused
> dat %>% mutate(new_col = "a")
# Source:   lazy query [?? x 4]                                                                                                                                                                             
# Database: sqlite 3.30.1                                                                                                                                                                                   
#   [/var/folders/7t/9l4kkf_j2sqbpn321y9g5558z96ck_/T//RtmpkrJIF9/file162c922eebc11]                                                                                                                        
   pkey letters numbers new_col
  <int> <chr>     <dbl> <chr>
1     1 a        -0.941 a
2     2 b         1.76  a
3     3 c         1.58  a
4     4 d        -1.28  a
5     5 e        -2.77  a
> dat %>% mutate(new_col = numbers)
# Source:   lazy query [?? x 4]                                                                                                                                                                             
# Database: sqlite 3.30.1                                                                                                                                                                                   
#   [/var/folders/7t/9l4kkf_j2sqbpn321y9g5558z96ck_/T//RtmpkrJIF9/file162c922eebc11]                                                                                                                        
   pkey letters numbers new_col
  <int> <chr>     <dbl>   <dbl>
1     1 a        -0.941  -0.941
2     2 b         1.76    1.76
3     3 c         1.58    1.58
4     4 d        -1.28   -1.28
5     5 e        -2.77   -2.77
> dat %>% mutate(new_col = numbers > 0)
# Source:   lazy query [?? x 4]                                                                                                                                                                             
# Database: sqlite 3.30.1                                                                                                                                                                                   
#   [/var/folders/7t/9l4kkf_j2sqbpn321y9g5558z96ck_/T//RtmpkrJIF9/file162c922eebc11]                                                                                                                        
   pkey letters numbers new_col
  <int> <chr>     <dbl>   <int>
1     1 a        -0.941       0
2     2 b         1.76        1
3     3 c         1.58        1
4     4 d        -1.28        0
5     5 e        -2.77        0
> sdf_2 %>% mutate(new_col = numbers > 0)
SQLDataFrame with 5 rows and 3 columns
      pkey |     letters    numbers   new_col
 <integer> | <character>  <numeric> <integer>
         1 |           a -0.9409161         0
         2 |           b  1.7561005         1
         3 |           c  1.5790539         1
         4 |           d -1.2770926         0
         5 |           e -2.7703379         0
Liubuntu commented 4 years ago

Also I want to mention that there is a lighter version of SQLDataFrame to use and is currently saved as the "reimplement" branch of this repo. I am planning to submit this after the new release to replace the current SQLDataFrame package. https://github.com/Bioconductor/SQLDataFrame/tree/reimplement

If you are using SQLDataFrame for simply representing SQLite tables (or Google BigQuery, MySQL), I would recommend this lighter version. If your work involves cross-database operations, i.e., joining two MySQL tables with different user credentials, you can use the current version. After submitting the new package, I might rename the current version as SQLDataFrameComplex.

jorainer commented 4 years ago

Thanks for your explanations @Liubuntu !

What I was actually looking for is a way to add additional columns (with arbitrary values) to a (SQL)DataFrame or to replace existing columns. Will this be possible with the new implementation?

Liubuntu commented 4 years ago

Hello,

The current and reimplemented versions don't support adding or replacing a column with arbitrary values right now, since they are basically rely on dbplyr for the existing functions. However, I think it is possible to add this feature through some SQL procedures which was used somehow in the current version, but unfortunately I won't be able to investigate further very soon.

Best, Qian

On Fri, Apr 17, 2020, 7:43 AM Johannes Rainer notifications@github.com wrote:

Thanks for your explanations @Liubuntu https://github.com/Liubuntu !

What I was actually looking for is a way to add additional columns (with arbitrary values) to a (SQL)DataFrame or to replace existing columns. Will this be possible with the new implementation?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Bioconductor/SQLDataFrame/issues/3#issuecomment-615199496, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADAYVSIJLTP5KIUDAS7AFADRNA6FPANCNFSM4MII2CHA .

jorainer commented 4 years ago

Thanks for the feedback!