DyfanJones / RAthena

Connect R to Athena using Boto3 SDK (DBI Interface)
https://dyfanjones.github.io/RAthena/
Other
35 stars 6 forks source link

Possible Better formatting for the return partitions of an AWS Athena table #129

Closed DyfanJones closed 3 years ago

DyfanJones commented 3 years ago

Currently RAthena has a function dbGetPartition which returns a data.table in the default AWS Athena format:

library(DBI)
library(RAthena)
library(data.table)

con <- dbConnect(athena())

test_df2_partitions = dbGetPartition(con, "test_df2")

#                   partition
# 1: year=2020/month=11/day=17

This format isn't too bad as it just returns format from Athena. Would it be useful to reformat this into the following?

get_partitions = function(dt){
  dt = dt[, tstrsplit(partition, split =  "/")]
  partitions = sapply(names(dt), function(x) strsplit(dt[[x]][1], split = "=")[[1]][1])
  for (col in names(dt)) set(dt, j=col, value=tstrsplit(dt[[col]], split =  "=")[2])
  setnames(dt, old = names(dt), new = partitions)
  return(dt)
}

get_partitions(test_df2_partitions)

#   year month day
# 1: 2020    11  17

Problem is that dbGetPartition has been in the package for sometime and changing it now would break possible solutions users have developed.

DyfanJones commented 3 years ago

Possible solution, is to add a formatting parameter for dbGetPartition:

https://github.com/DyfanJones/RAthena/blob/07d3c8fd990a3b32e4863dfbba3b428fdb633317/R/Connection.R#L768-L797

This way previous behaviour is maintain, if a user wants to use the new behaviour then they can use the following.

library(DBI)
library(data.table)
library(RAthena)

con <- dbConnect(athena())

dbGetPartition(con, "test_df2", .format = T)

# Info: (Data scanned: 0 Bytes)
#    year month day
# 1: 2020    11  17

dbGetPartition(con, "test_df2")

# Info: (Data scanned: 0 Bytes)
#                    partition
# 1: year=2020/month=11/day=17
DyfanJones commented 3 years ago

PR #1330 adds new format