DyfanJones / noctua

Connect R to Athena using paws SDK (DBI Interface)
https://dyfanjones.github.io/noctua/
Other
45 stars 5 forks source link

Throttling Exception when trying to pull 62GBs of data from AWS Athena #58

Closed ricardojcc closed 4 years ago

ricardojcc commented 4 years ago

Issue Description

Hi,

I have been trying to connect to Athena from Sagemaker. I have been successful, but in the case of slightly bigger data, when I run the following code:

install.packages('noctua') install.packages('DBI')

con <- DBI::dbConnect(noctua::athena(), s3_staging_dir = "s3://aws-athena...", region_name = 'us-west-2')

res <- DBI::dbExecute(con, 'SELECT * FROM db.table
LIMIT 10000000') df <- DBI::dbFetch(res)

I get the following error on Sagemaker: Error: ThrottlingException: Rate exceeded Traceback:

At first I though that I needed more RAM, so I edited my instance to where it had 64gb, seeing as the previous one had 16. But the error still persists.

I am farely new to AWS and its intricacies. While I was looking for a way to overcome this issue I stumbled upon the following article:

https://aws.amazon.com/blogs/messaging-and-targeting/how-to-handle-a-throttling-maximum-sending-rate-exceeded-error/

Looking at the article, I guess my question would be, is there a way to use some sort of rate limiting function? If it is supported by paws?

Ps. I hope I wrote this in the correct format and place.

Any help is greatly appreciated,

Ricardo

DyfanJones commented 4 years ago

Hi Ricardo,

First off thanks for raising this issue, and I am more than happy to help and rectify this.

If it is ok with you can you do 2 things for me.

1) I want to rule out if it is a paws sdk issue: Can you try the same query with RAthena, and let me know if it can pull that data back for you.

2) I want to check if it is a sagemaker issue: Can you return something which is abit more sizeable using the noctua pacakge, say 16 GB of data in your sagemaker instance.

I will spin up a sagemaker instance in the mean time and upload a large dataframe to do some testing and identify the issue

DyfanJones commented 4 years ago

In the mean time I believe this could be an issue with the poll rate. I will do some reading around AWS "Throttling – Maximum sending rate exceeded".

DyfanJones commented 4 years ago

Ricardo,

Can you try the following query:

install.packages('noctua')

library(DBI)

con <- dbConnect(noctua::athena(), 
   s3_staging_dir = "s3://aws-athena...",
   region_name = 'us-west-2',
   poll_interval = 1) # this will increase the poll interval to 1 second

df <- dbGetQuery(con, 'SELECT * FROM db.table LIMIT 10000000')
ricardojcc commented 4 years ago

Hi Dyfan,

I tried with a sizable amount of data, around 20gb to start of, using noctua and setting the poll_interval = 1 and got the exception. I went back and did the same with RAthena, without the poll_interval argument and RAthena is returning the full query result. I also tried a slightly bigger data set, since this is just a project I'm working on at my job, and RAthena is doing the job.

DyfanJones commented 4 years ago

Thanks, I will investigate why there is an issue with noctua and will contact the paws developers to see if they know what could be the issue. In the mean time feel free to use RAthena if it is working for you :)

DyfanJones commented 4 years ago

Can you share the connection information for me please using:

dbGetInfo(con)

I will try and replicate your issue

ricardojcc commented 4 years ago

This is the info,

$profile_name 'default' $s3_staging 's3://aws-athena-...-us-west-2/' $dbms.name 'default' $work_group NULL $poll_interval 1 $encryption_option NULL $kms_key NULL $expiration NULL $region_name 'us-west-2' $paws '0.1.6' $noctua '1.5.0'

This is what the function return. I know it does not really return useful info. I can't really disclose my staging dir because of the nature of the data. I apologies for that. Nonetheless, I am more than open to help wherever I can.

DyfanJones commented 4 years ago

Don't worry about the specific s3 bucket, I can utilise my own one for testing. I am just checking the paws version and if any system variables are being used in the connection parameters. I believe it could be poll issue and increasing it might resolve this issue however I need to do more research around this error to double check. Plus there be a way paws is calling aws that needs to take into account something else.

DyfanJones commented 4 years ago

Ricardo,

Can you let me know at what point does the error happen. Does the error happen on the dbExecute or the dbFetch?

If possible can you split these two code and execute them independently.

ricardojcc commented 4 years ago

Thanks Dyfan. The error occurs on dbExecute. RAthena is working great!

DyfanJones commented 4 years ago

Perfect, that means it is a polling issue

DyfanJones commented 4 years ago

I am not getting aws error: ThrottlingException: Rate exceeded when I test with nyc taxi data 2018: 01-06 data size of 53,925,735 x 11 (7 GB) however I did get and R error with writeBin.

This is due to: Only 2^31 - 1 bytes can be written in a single call (and that is the maximum capacity of a raw vector on 32-bit platforms).

Will modify the dbFetch to read binary stream in chunks and test to see if this fixes the issue: https://stackoverflow.com/questions/50297237/raw-binary-data-too-big-to-write-to-disk-how-to-write-chunk-wise-to-disk-app

DyfanJones commented 4 years ago

Will increase data size to test if new implementation resolves issue.

DyfanJones commented 4 years ago

@ricardojcc does PR #62 fixed your issue? The backend has been updated. If you still get the issue please let me know

DyfanJones commented 4 years ago

Used 2018 yellow taxi data from: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page data size: ~ 14.8 Gb nrows: 102,804,250 ncolumns: 17

Used gzip compression to reduce cost in this test.

library(data.table)
library(DBI)

con <- dbConnect(noctua::athena())

FILES <- list.files("nyc_taxi_2018", full.names = T)
nyc_taxi_2018 <- rbindlist(lapply(FILES, fread))

dim(nyc_taxi_2018)
# 102,804,250        17

format(object.size(nyc_taxi_2018),units = "auto")
"14.8 Gb"

# write data to Athena using gzip compression
dbWriteTable(con, "nyc_taxi_2018", nyc_taxi_2018, overwrite = T, compress = T)

nyc_taxi_2018 <- dbGetQuery(con, "select * from nyc_taxi_2018")
# Info: (Data scanned: 1.68 GB)

dim(nyc_taxi_2018)
# 102,804,250        17

format(object.size(nyc_taxi_2018), units = "auto")
# "14.8 Gb"

It is reassuring to see that the compression method has reduced the data from 14.8Gb to 1.68GB. @ricardojcc can you try the same data set to see if it is an account issue?

I am going to close this ticket as it looks like the up and coming package version resolves this issue. If the problem re-appears please re-open this ticket or raise a new one.

ricardojcc commented 4 years ago

Hey @DyfanJones I apologize for the late reply. Yeah everything's working like a charm. Thanks man!

DyfanJones commented 4 years ago

These changes have now been pushed the cran