Closed OssiLehtinen closed 3 years ago
@OssiLehtinen thanks for finding this out :)
I will have remove the extra '/'
Happy holidays to you as well 😄
Actually we need to be smart with this. By simply removing the extra "/" this will cause issues with tables with similar names for example:
glue <- paws::glue()
s3 <- paws::s3()
all_keys <- list()
s3_path <- noctua:::split_s3_uri(glue$get_table(
DatabaseName = "default",
Name = "df_bigint")$Table$StorageDescriptor$Location)
objects <- s3$list_objects_v2(Bucket=s3_path$bucket, Prefix=s3_path$key)
all_keys <- lapply(objects$Contents, function(x) list(Key=x$Key))
all_keys
[[1]]
[[1]]$Key
[1] "default/df_bigint/46582453-e4db-4a2d-8787-4c1776b1c7b9.tsv"
[[2]]
[[2]]$Key
[1] "default/df_bigint_2.txt"
This find the correct table "df_bigint" however it also returns "df_bigint2.txt" which could be another table
Possible solution would be to detect if key ends with "/" and if it has "." within string 🤔
glue <- paws::glue()
s3 <- paws::s3()
all_keys <- list()
s3_path <- noctua:::split_s3_uri(glue$get_table(
DatabaseName = "default",
Name = "df_bigint")$Table$StorageDescriptor$Location)
# edit: fixed check for "/" and "."
if(!grepl("\\.|/$", s3_path$key))
s3_path$key <- sprintf("%s/", s3_path$key)
objects <- s3$list_objects_v2(Bucket=s3_path$bucket, Prefix=s3_path$key)
all_keys <- lapply(objects$Contents, function(x) list(Key=x$Key))
all_keys
[[1]]
[[1]]$Key
[1] "default/df_bigint/46582453-e4db-4a2d-8787-4c1776b1c7b9.tsv"
Fast reaction from you as always!
We were also speculating about those extraneous matches without the trailing '/' with a colleague.
In principle, that is a symptom of a misspecified table in glue. In such a case you describe Athena would try to read data from both "default/df_bigint/46582453-e4db-4a2d-8787-4c1776b1c7b9.tsv" and "default/df_bigint_2.txt", so in principle those are data from the same table and should be deleted. Of course this can set a user up for some surprises, but nevertheless, the real mistake has been made already earlier...
I'm not a 100% sure on that, we can test it anyway 😄 Lets upload another df_bigint
table but call it df_bigint_2
library(DBI)
s3.location <- Sys.getenv("noctua_s3_tbl")
con <- dbConnect(noctua::athena())
df2 <- data.frame(var1 = sample(letters, 10, replace = T),
var2 = bit64::as.integer64(1:10),
stringsAsFactors = F)
dbWriteTable(con, "df_bigint", df2, overwrite = T, s3.location = s3.location)
dbWriteTable(con, "df_bigint_2", df2, overwrite = T, s3.location = s3.location)
dbGetQuery(con, "select * from df_bigint")
# Info: (Data scanned: 51 Bytes)
# var1 var2
# 1: h 1
# 2: r 2
# 3: r 3
# 4: e 4
# 5: p 5
# 6: g 6
# 7: z 7
# 8: k 8
# 9: t 9
# 10: h 10
dbGetQuery(con, "select * from df_bigint_2")
# Info: (Data scanned: 51 Bytes)
# var1 var2
# 1: h 1
# 2: r 2
# 3: r 3
# 4: e 4
# 5: p 5
# 6: g 6
# 7: z 7
# 8: k 8
# 9: t 9
# 10: h 10
So in this example Athena has 2 tables within database "default", "df_bigint" and "df_bigint_2". It appears AWS Athena can happily read from each table without getting confused by anything.
glue <- paws::glue()
s3 <- paws::s3()
all_keys <- list()
s3_path <- noctua:::split_s3_uri(glue$get_table(
DatabaseName = "default",
Name = "df_bigint")$Table$StorageDescriptor$Location)
s3_path
# $key
# [1] "default/df_bigint"
objects <- s3$list_objects_v2(Bucket=s3_path$bucket, Prefix=s3_path$key)
all_keys1 <- lapply(objects$Contents, function(x) list(Key=x$Key))
all_keys1
# [[1]]
# [[1]]$Key
# [1] "default/df_bigint/d88d8a53-90fd-4d5c-91e6-962042bb392d.tsv"
#
# [[2]]
# [[2]]$Key
# [1] "default/df_bigint_2/8797d27f-3577-4d8b-8579-64416619d1d9.tsv"
When not adding the "/" at the end it looks like it brings back both "df_bigint" and "df_bigint_2" source files.
# check for "/" and "."
if(!grepl("\\.|/$", s3_path$key))
s3_path$key <- sprintf("%s/", s3_path$key)
objects <- s3$list_objects_v2(Bucket=s3_path$bucket, Prefix=s3_path$key)
all_keys2 <- lapply(objects$Contents, function(x) list(Key=x$Key))
all_keys2
# [[1]]
# [[1]]$Key
# [1] "default/df_bigint/d88d8a53-90fd-4d5c-91e6-962042bb392d.tsv"
When we add in the checking method then s3 returns the correct source file 😄
Update check mechanic:
check <- list("check1/", "check2", "check.3", "/check4", "check.5/")
for(str in check){
if(!grepl("\\.|/$", str))
print(sprintf("%s/", str))
else(print(str))
}
# [1] "check1/"
# [1] "check2/"
# [1] "check.3"
# [1] "/check4/"
# [1] "check.5/"
Did a small check list to see if new check mechanic would work. It seems fairly reliable. Please feel free to critique 😄
You are probably right, but I'm a bit surprised. Does the following command show a trailing '/' or not for your example table? conn@ptr$glue$get_table(DatabaseName = "YOUR_DATABASE_HERE", Name = "df_bigint")$Table$StorageDescriptor$Location
Aside from that, probably just my inexperience with regex, but would a path with dots earlier cause problems there? Something like "filesfrom_11.03.20/datafiles".
Yeah my example doesn't have a "/" at the end of the key when calling glue:
# $key
# [1] "default/df_bigint"
AWS must do a transform in the background. From noctua
's side, we create the s3 locations following best practises from here: https://docs.aws.amazon.com/athena/latest/ug/tables-location-format.html
When you specify the LOCATION in the CREATE TABLE statement, use the following guidelines: Use a trailing slash. Use:
s3://bucketname/folder/
Do not use any of the following items for specifying the LOCATION for your data.
- Do not use filenames, underscores, wildcards, or glob patterns for specifying file locations.
- Do not add the full HTTP notation, such as s3.amazon.com to the Amazon S3 bucket path.
- Do not specify an Amazon S3 access point in the LOCATION clause. The table location can only be specified as a URI.
- Do not use empty folders like // in the path, as follows: S3://bucketname/folder//folder/. While this is a valid Amazon S3 path, Athena does not allow it and changes it to s3://bucketname/folder/folder/, removing the extra /.
Do not use:
s3://path_to_bucket s3://path_to_bucket/* s3://path_to_bucket/mySpecialFile.dat s3://bucketname/prefix/filename.csv s3://test-bucket.s3.amazon.com S3://bucket/prefix//prefix/ arn:aws:s3:::bucketname/prefix s3://arn:aws:s3:<region>:<account_id>:accesspoint/<accesspointname> https://<accesspointname>-<number>.s3-accesspoint.<region>.amazonaws.com
Yes a file path like filesfrom_11.03.20/datafiles
would cause issues. The current proposed check wouldn't do anything and just send it to paws::s3()$list_objects_v2
. list_objects_v2
will then return everything from filesfrom_11.03.20/datafiles/
including if there is something like this filesfrom_11.03.20/datafiles_something_else/
However as "." seem to part of the do not use section we could just say we don't support "." for AWS Athena table locations.
Ah the best practices is a good guideline and making a disclaimer about dots in paths sounds good too.
closing this issue due to merging PR #126
Hi Dyfan, bumped into another bug.
If a glue table definition has a trailing '/' in the s3 path, dbRemoveTable doesn't work. (note, such a table definition is allowed in glue)
I think the reason is on this line and the pasting of the extra '/' at the end of the s3_path$key: https://github.com/DyfanJones/noctua/blob/e3ce929e58d475f4f8953ef0d360015db10d2aa7/R/Connection.R#L608
For example, if we have a path like "s3://mybucket/mypath/mysubpath/", this will produce a key "s3://mybucket/mypath/mysubpath//" and things will break down.
Also, if a table definition points directly a single file, as in, "s3://mybucket/mypath/mysubpath/myfile.csv", dbRemoveTable will not work either
As a fix, I would propose omitting the extra '/'.
All the best and happy holidays, Ossi