cloudyr / googleCloudStorageR

Google Cloud Storage API to R
https://code.markedmondson.me/googleCloudStorageR
Other
104 stars 29 forks source link

Resumable upload issues #114

Open ben519 opened 4 years ago

ben519 commented 4 years ago

I keep running into errors when attempting to upload some .rds datasets to google cloud storage. For example, here's one part of a big data pipeline

gcs_upload(file = "data/processed/events.rds", bucket = "my-bucket")

2019-12-18 16:04:03 -- File size detected as 19.7 Mb
2019-12-18 16:04:03 -- Found resumeable upload URL: https://www.googleapis.com/upload/storage/v1/b/my-bucket/o/?uploadType=resumable&name=data%2Fprocessed%2Fevents.rds&predefinedAcl=private&upload_id=EAnB2UqydhvneCC4ius3M0mRep13I9p_CAu50iFmqDJenPcsAORi23utVi9jIpKX_uL6DWT4OIWjNLyV97E13Yi4m2Kq0w
2019-12-18 16:10:18 -- File upload failed, trying to resume...
2019-12-18 16:10:18 -- Retry 3 of 3
Error in gcs_retry_upload(upload_url = upload_url, file = temp, type = type) : 
  Must supply either retry_object or all of upload_url, file and type
In addition: Warning messages:
1: No JSON content detected 
2: In doHttrRequest(req_url, shiny_access_token = shiny_access_token,  :
  API checks failed, returning request without JSON parsing

Oddly, running this a second time without changing anything, works (albeit with warnings)

gcs_upload(file = "data/processed/events.rds", bucket = "my-bucket")

2019-12-18 16:16:07 -- File size detected as 19.7 Mb
2019-12-18 16:16:08 -- Found resumeable upload URL: https://www.googleapis.com/upload/storage/v1/b/my-bucket/o/?uploadType=resumable&name=data%2Fprocessed%2Fevents.rds&predefinedAcl=private&upload_id=EAnB2UqydhvneC4ius3M0mRep13I9p_CAu50iFqDJenPcsAORi23utVi9jIpKX_uL6WT4OIWjNLyV97E13Yi4m2Kq0w
Warning messages:
1: No JSON content detected 
2: In doHttrRequest(req_url, shiny_access_token = shiny_access_token,  :
  API checks failed, returning request without JSON parsing

Another example,

gcs_upload(file = "data/processed/eventperformers.rds", bucket = "my-bucket")

2019-12-18 16:21:17 -- File size detected as 2.4 Mb
2019-12-18 16:27:20> Request Status Code: 408
Error : lexical error: invalid char in json text.
                                       <!DOCTYPE html> <html lang=en> 
                     (right here) ------^

<!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 408 (Request Timeout)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>408.</b> <ins>That’s an error.</ins>
  <p>Your client has taken too long to issue its request.  <ins>That’s all we know.</ins>
Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <html lang=en> 
                     (right here) ------^

In this case, the upload actually works even though the function errors (although it took about 10 minutes to upload this 2.4Mb file).

Am I using gcs_upload() properly? Any advice on how I can make it run smoother? Note that I'm using googleCloudStorageR v 0.5.1. Much appreciated!

MarkEdmondson1234 commented 4 years ago

It looks like you are using it fine, its just this function is hard for me to test as its when intermittent errors effect your upload. It has an auto-retry method to help with larger uploads of TBs, so its odd you are running into issues at much smaller loads. Perhaps you have a weak connection or a proxy or something that means you see it more often? 10mins to upload 2.4 MB is very long, it takes me seconds to upload similar sizes on around 20MBit internet connection (that makes sense). I guess you have something special about your connection?

ben519 commented 4 years ago

Thanks for the prompt reply. I figured it'd be hard one to resolve since you can't exactly debug, but I wanted to log the issue in case others are experiencing the same trouble.

AFAIK there's nothing special about my connection. I've experienced the same issues on many different networks. Perhaps it has something to do with the files. I'll keep tinkering with this. Thanks.

AndrewMarritt commented 4 years ago

Hi Mark,

I'm having a similar issue with gcs_upload

upload_try <- gcs_upload(file = parsed_download, name = "ft_bucket/trainTestUpload2.csv") 2020-03-25 10:52:35 -- File size detected as 64.6 Mb 2020-03-25 10:52:35 -- Found resumeable upload URL: https://www.googleapis.com/upload/storage/v1/b/my-bucket/o/?uploadType=resumable&name=ft_bucket%2FtrainTestUpload2.csv&predefinedAcl=private&upload_id=AEnB2UqidgMfd84OiQNPmjEaddPFJFIP_OBvMoA2-lneWdmAd4T3z9LydRDHNBuWgdUoFKCrJVepXBlnwTPwcgpDwO2sO7zEnw Warning messages: 1: No JSON content detected 2: In doHttrRequest(req_url, shiny_access_token = shiny_access_token, : API checks failed, returning request without JSON parsing

This uploads and I can see the file via the web interface. If I download it via that I'm provided a CSV which I can read into R.

However if I try to use gcs_get_object...

parsed_download2 <- gcs_get_object("ft_bucket/trainTestUpload2.csv") Downloaded ft_bucket%2FtrainTestUpload2.csv Object parsed to class: raw

I'm unable to use gcs_parse_download() on this object.

aaelony-fb commented 7 months ago

I think trying various values of upload limit will get one past this issue: options(googleCloudStorageR.upload_limit = 1000000000L)

For the case where it is a large file for which the uploaded needs to be resumed this makes sense, but my understanding is that sometimes the file is corrupted and you really just want it to not resume but overwrite the upload.

Happy to learn about a better way to resolve this as well.