DyfanJones / noctua

Connect R to Athena using paws SDK (DBI Interface)
https://dyfanjones.github.io/noctua/
Other
45 stars 5 forks source link

Sub-query fails with dplyr interface indicating "Only one sql statement is allowed" #190

Closed nicholsn closed 1 year ago

nicholsn commented 1 year ago

Issue Description

When following the example for how to submit a query using dplyr I get an error indicating that "Only one sql statement is allowed".

From the error statement it looks like dplyr or noctua is injecting a SELECT * FROM (...) to wrap around my query that includes a sub-query in the FROM clause.

Error: InvalidRequestException (HTTP 400). Only one sql statement is allowed. Got: SELECT *
FROM (
SELECT crf.variant_id,
    rsid,
    chrom,
    pos,
    ref,
    alt,
    gene,
    n_cases,
    n_controls,
    af,
    beta,
    se,
    pvalue
FROM reference.dbsnp,
    (
        SELECT REPLACE(
                REPLACE(
                    REPLACE(REPLACE(markerid, 'chr', ''), ':', '-'),
                    '_',
                    '-'
                ),
                '/',
                '-'
            ) AS variant_id,
            gene,
        n_cases,
        n_controls,
        af,
        beta,
        se,
        pvalue
        FROM genebass.chronic_renal_failure
    ) crf
WHERE dbsnp.variant_id = crf.variant_id;
) "q01"
WHERE (0 = 1)

Reproducible Example

I don't have a way of reproducing this since I think it is specific to Athena only being able to process one query at a time, but I can share the query.

library(DBI)
library(noctua)
library(dplyr)
library(tibble)

# Create a connection to Athena
con <- dbConnect(
  noctua::athena(), 
  s3_staging_dir = 's3://aws-athena-query-results-123456789-us-west-2/'
  )

query <- sql("SELECT crf.variant_id,
    rsid,
    chrom,
    pos,
    ref,
    alt,
    gene,
    n_cases,
    n_controls,
    af,
    beta,
    se,
    pvalue
FROM reference.dbsnp,
    (
        SELECT REPLACE(
                REPLACE(
                    REPLACE(REPLACE(markerid, 'chr', ''), ':', '-'),
                    '_',
                    '-'
                ),
                '/',
                '-'
            ) AS variant_id,
            gene,
        n_cases,
        n_controls,
        af,
        beta,
        se,
        pvalue
        FROM genebass.chronic_renal_failure
    ) crf
WHERE dbsnp.variant_id = crf.variant_id;")

tbl(con, query) %>% collect() -> df

The above throws the error however this works

res <- dbExecute(con, query)
df <- as_tibble(dbFetch(res)

Any thoughts on why Athena is interpreting this as two SELECT statements?

Session Info ```r devtools::session_info() R version 4.2.2 (2022-10-31) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: macOS Big Sur 11.6.1 Matrix products: default LAPACK: /Users/nnichols/micromamba/envs/mr/lib/libopenblasp-r0.3.21.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] tibble_3.1.8 tidyr_1.2.1 dplyr_1.0.10 noctua_2.6.0 [5] gwasvcf_0.1.0 VariantAnnotation_1.44.0 Rsamtools_2.14.0 Biostrings_2.66.0 [9] XVector_0.38.0 SummarizedExperiment_1.28.0 Biobase_2.58.0 GenomicRanges_1.50.0 [13] GenomeInfoDb_1.34.1 IRanges_2.32.0 S4Vectors_0.36.0 MatrixGenerics_1.10.0 [17] matrixStats_0.62.0 BiocGenerics_0.44.0 DBI_1.1.3 BiocManager_1.30.19 loaded via a namespace (and not attached): [1] bitops_1.0-7 bit64_4.0.5 filelock_1.0.2 progress_1.2.2 httr_1.4.4 [6] tools_4.2.2 utf8_1.2.2 R6_2.5.1 withr_2.5.0 tidyselect_1.2.0 [11] prettyunits_1.1.1 bit_4.0.4 curl_4.3.3 compiler_4.2.2 cli_3.4.1 [16] xml2_1.3.3 paws_0.1.12 DelayedArray_0.24.0 rtracklayer_1.58.0 readr_2.1.3 [21] paws.analytics_0.1.12 rappdirs_0.3.3 stringr_1.4.1 digest_0.6.30 pkgconfig_2.0.3 [26] dbplyr_2.2.1 fastmap_1.1.0 BSgenome_1.66.1 rlang_1.0.6 rstudioapi_0.14 [31] RSQLite_2.2.18 BiocIO_1.8.0 generics_0.1.3 jsonlite_1.8.3 BiocParallel_1.32.0 [36] RCurl_1.98-1.9 magrittr_2.0.3 GenomeInfoDbData_1.2.9 Matrix_1.5-3 Rcpp_1.0.9 [41] fansi_1.0.3 lifecycle_1.0.3 stringi_1.7.8 yaml_2.3.6 zlibbioc_1.44.0 [46] BiocFileCache_2.6.0 grid_4.2.2 blob_1.2.3 parallel_4.2.2 crayon_1.5.2 [51] lattice_0.20-45 GenomicFeatures_1.50.2 hms_1.1.2 KEGGREST_1.38.0 knitr_1.40 [56] pillar_1.8.1 uuid_1.1-0 rjson_0.2.21 codetools_0.2-18 paws.common_0.5.1 [61] biomaRt_2.54.0 XML_3.99-0.12 glue_1.6.2 data.table_1.14.4 png_0.1-7 [66] vctrs_0.5.0 tzdb_0.3.0 purrr_0.3.5 assertthat_0.2.1 cachem_1.0.6 [71] xfun_0.34 paws.storage_0.1.12 mime_0.12 restfulr_0.0.15 GenomicAlignments_1.34.0 [76] AnnotationDbi_1.60.0 memoise_2.0.1 ellipsis_0.3.2 ```
DyfanJones commented 1 year ago

Hi @nicholsn, interesting. Have you tried dplyr::show_query() to find out the query it is trying to send to athena?

nicholsn commented 1 year ago

Hey @DyfanJones, I did try to pipe it into show_query but the error is thrown even with just tbl(con, query) - no collect() needed.

DyfanJones commented 1 year ago

If it is ok, i will have a deeper look tomorrow :) i need to see if i can replicate this issue :) i am guessing it is erroring when it is trying to get the column names for the tbl class.

DyfanJones commented 1 year ago

OOO just a thought, I think i know what is going. For tbl, dplyr does an initial query to what ever database it is connected to. It does this by sub-querying the table so it will look something like:

-- example of sub-query
select var1, var2, var3 from blah.table

dplyr's wrapper

select * from (
select var1, var2, var3 from blah.table
) as sub_query
where 1 = 0

However in your query you have ; at the end of your query, so dplyr will do the following:

select * from (
select var1, var2, var3 from blah.table;
) as sub_query
where 1 = 0

As there a ; inside sub-query it will fail. So can you try without ;. I hope this helps to explain what is going on in the backend. Please let me know if this fixes your issue :)

DyfanJones commented 1 year ago

I can confirm this is the case. You need to remove ; from your sub-query. Here is an example demonstrating this.

library(DBI)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

con <- dbConnect(noctua::athena())
tbl(con, sql("select * from sampledb.elb_logs;"))
#> Error: InvalidRequestException (HTTP 400). Only one sql statement is allowed. Got: SELECT *
#> FROM (select * from sampledb.elb_logs;) "q01"
#> WHERE (0 = 1)
tbl(con, sql("select * from sampledb.elb_logs"))
#> INFO: (Data scanned: 0 Bytes)
#> INFO: (Data scanned: 2.52 MB)
#> # Source:   SQL [?? x 19]
#> # Database: Athena 0.1.12 [default@eu-west-1/default]
#>    request_tim…¹ elb_n…² reque…³ reque…⁴ backe…⁵ backe…⁶ reque…⁷ backe…⁸ clien…⁹
#>    <chr>         <chr>   <chr>     <int> <chr>     <int>   <dbl>   <dbl>   <dbl>
#>  1 2015-01-01T0… elb_de… 248.46…   31279 172.40…      80 3.26e-4 7.85e-4 6.91e-4
#>  2 2015-01-01T0… elb_de… 248.11…   22272 172.46…    8888 6.53e-4 9.53e-4 7.61e-4
#>  3 2015-01-01T0… elb_de… 253.19…   30529 172.30…     443 1.26e-4 8.1 e-4 3.52e-4
#>  4 2015-01-01T0… elb_de… 247.24…   13001 172.41…    8888 1.43e-3 1.39e-3 5.56e-4
#>  5 2015-01-01T0… elb_de… 248.21…    6024 172.52…    8888 9.31e-4 2.62e-4 6.53e-4
#>  6 2015-01-01T0… elb_de… 245.20…   16173 172.38…     443 1.1 e-3 1.99e-3 1.38e-3
#>  7 2015-01-01T0… elb_de… 253.13…   29168 172.33…      80 5.37e-4 1.67e-3 6.13e-4
#>  8 2015-01-01T0… elb_de… 250.16…   26741 172.49…    8888 1.52e-3 1.08e-3 9.06e-4
#>  9 2015-01-01T0… elb_de… 251.12…   16934 172.32…      80 2.12e-4 9.59e-4 9.69e-4
#> 10 2015-01-01T0… elb_de… 241.13…   18980 172.55…    8888 8.86e-4 9.97e-4 1.32e-3
#> # … with more rows, 10 more variables: elb_response_code <chr>,
#> #   backend_response_code <chr>, received_bytes <int64>, sent_bytes <int64>,
#> #   request_verb <chr>, url <chr>, protocol <chr>, user_agent <chr>,
#> #   ssl_cipher <chr>, ssl_protocol <chr>, and abbreviated variable names
#> #   ¹​request_timestamp, ²​elb_name, ³​request_ip, ⁴​request_port, ⁵​backend_ip,
#> #   ⁶​backend_port, ⁷​request_processing_time, ⁸​backend_processing_time,
#> #   ⁹​client_response_time

Created on 2022-12-15 with reprex v2.0.2

I hope this helps

nicholsn commented 1 year ago

That was it - nice catch! The error message from dplyr is accurate if you can spot that they're wrapping your query and making it a sub-query. Thank you for your help, @DyfanJones!