DyfanJones / noctua

Connect R to Athena using paws SDK (DBI Interface)
https://dyfanjones.github.io/noctua/
Other
45 stars 5 forks source link

Unload option returns `null` results when `s3_staging_dir` is a bucket only #214

Closed dfsnow closed 2 months ago

dfsnow commented 3 months ago

Issue

Using dbGetQuery() with unload = TRUE returns a 0-row data frame when s3_staging_dir is set to a bucket with no key prefix e.g. s3://temp-bucket instead of s3://temp-bucket/results. This is due to a couple separate issues:

1.

The helper function split_s3_uri returns the bucket name when the s3_staging_dir is a bucket only and has no trailing slash. E.g.:

s3_bucket <- "s3://temp-bucket"
split_s3_uri(s3_bucket)
#> $bucket
#> [1] "temp-bucket"
#> 
#> $key
#> [1] "temp-bucket"

This results in the following key value construction in the .fetch_unload helper:

s3_bucket <- "s3://temp-bucket"
result_info <- split_s3_uri(s3_bucket)
result_info[["key"]] <- file.path(gsub("/$", "", result_info[["key"]]), res@info[["UnloadDir"]])

result_info
#> $bucket
#> [1] "temp-bucket"
#> 
#> $key
#> [1] "temp-bucket/41ed5dfd-1f1a-41c7-9793-fa939bc369de"

Which is passed to list_objects_v2 and results in a null set of keys, even if they actually exist.

2.

Similarly, the split_s3_uri helper returns the following when the staging directory has a trailing slash:

s3_bucket <- "s3://temp-bucket/"
split_s3_uri(s3_bucket)
#> $bucket
#> [1] "temp-bucket"
#> 
#> $key
#> [1] ""

Which results in the following bucket and key combo inside .fetch_unload:

s3_bucket <- "s3://temp-bucket/"
result_info <- split_s3_uri(s3_bucket)
result_info[["key"]] <- file.path(gsub("/$", "", result_info[["key"]]), res@info[["UnloadDir"]])

result_info
#> $bucket
#> [1] "temp-bucket"
#> 
#> $key
#> [1] "/41ed5dfd-1f1a-41c7-9793-fa939bc369de"

The prefixed forward slash in the key again results in null results when passed to list_objects_v2 inside .fetch_unload.

DyfanJones commented 3 months ago

Thanks for identifying the issue :)