basak / glacier-cli

Command-line interface to Amazon Glacier
Other
617 stars 54 forks source link

Exception ends upload of very large file #8

Open khagler opened 12 years ago

khagler commented 12 years ago

Attempts to upload a very large (288.64 GB) file run for about an hour or two, then fail with the following output:

frontier:glacier-cli khagler$ ./glacier.py archive upload Photos ~/Documents/photos.tgz Traceback (most recent call last): File "./glacier.py", line 618, in App().main() File "./glacier.py", line 604, in main args.func(args) File "./glacier.py", line 416, in archive_upload archive_id = vault.create_archive_from_file(file_obj=args.file, description=name) File "/Users/khagler/glacier/glacier-cli/boto/glacier/vault.py", line 141, in create_archive_from_file writer.write(data) File "/Users/khagler/glacier/glacier-cli/boto/glacier/writer.py", line 152, in write self.send_part() File "/Users/khagler/glacier/glacier-cli/boto/glacier/writer.py", line 141, in send_part content_range, part) File "/Users/khagler/glacier/glacier-cli/boto/glacier/layer1.py", line 626, in upload_part response_headers=response_headers) File "/Users/khagler/glacier/glacier-cli/boto/glacier/layer1.py", line 83, in make_request data=data) File "/Users/khagler/glacier/glacier-cli/boto/connection.py", line 913, in make_request return self._mexe(http_request, sender, override_num_retries) File "/Users/khagler/glacier/glacier-cli/boto/connection.py", line 859, in _mexe raise e socket.gaierror: [Errno 8] nodename nor servname provided, or not known

OS: Mac OS X 10.7.5 Server Python 2.7.1

I tried uploading the same file using FastGlacier with the part size set to 1 GB. It would upload some of each part before failing with a message about the remote host dropping the connection. After setting the part size to 256 MB, it was able to upload individual parts successfully.

Addendum:

After a bit more investigation, I think I've figured out what might be going on. According to the Amazon documentation, the maximum number of parts for a multi-part upload is 10,000. For this (very large) archive to be split evenly into 10,000 parts, each part would have to be about 27.5 MB--or, given the limits on allowable part sizes, 32 MB. It looks like you're using a default part size (which I didn't realize at the time I could change) of 8 MB. If I'm right about that, then an 80 GB file would be a (marginally less painful) valid test.

basak commented 12 years ago

Sorry you're having problems and I appreciate the report. I also seen an upload failure which I assumed was some kind of network problem so I didn't save the traceback. Can you tell me if how this reproduces - every time, intermittent (and if so what sort of proportion), and is there any pattern to how much data is uploaded before it fails?

My own single failure prompted me to write automatic resume upload support. This is working and I expect to push it shortly. But despite that I'd like uploads to work first time!

khagler commented 12 years ago

It happens every time. I don't know how much data is being uploaded, but it runs for a pretty long time before failing. I'm almost certain that what's happening here is that it starts the upload with the default 4 MB part size, and then 40,000 MB worth of uploading later it tries to upload the next part and Amazon rejects it because the 10,000 part limit has been reached. Exactly how long that takes varies depending on what else I'm doing with my connection at the time (and how many of my neighbors are bittorrenting their favorite TV shows ;-), which accounts for the variable but long time to failure.

I've written a fix that checks the size of the archive to be uploaded and determines the smallest part size that will work if 4 MB is too small. I created a 50 GB dummy file, and found that it did indeed fail to upload as expected without the fix. I'm trying it now with the fix, and it's still running. I'll update when it eventually either finishes or fails.

fbueno commented 12 years ago

Same here. 301MB - OK 2.6GB - OK 37GB - failed

basak commented 12 years ago

Based on the code, it looks like the problem is that the particular boto.glacier method I'm using doesn't let me pick a part size and chooses 4 MiB arbitrarily. So a suitable fix would be to automatically determine a suitable part size as khagler described, but I think this would need to go into boto rather than glacier-cli.

khagler: is this what you're working on, or shall I?

khagler commented 12 years ago

Yes, basically. I had modified your archive_upload so that it modified vault.DefaultPartSize, but I agree that this really ought to be done in boto, so I'll see about moving it there.

I've run a few tests, and while my fix does take care of the original problem, it exposes a new one: It seems to be pretty common for individual part uploads to fail (I've been seeing about a 1% failure rate in FastGlacier, which tells you when it happens). Unfortunately, boto doesn't seem to have a way to note this and retry failed parts.

fastfwd75 commented 11 years ago

I am also getting this. Tried with 4GB splits and it failed. Tried with 1023M splits and it also failed. I can't realistically go smaller. Any hope of a fix for this? The same 1023M split uploaded without troubles in "simple amazon glacier uploader" but I prefer command line tools.

socket.gaierror: [Errno 8] nodename nor servname provided, or not known

fastfwd75 commented 11 years ago

Full error text:

Traceback (most recent call last): File "/Users/jonathan/glacier/glacier-cli/glacier", line 694, in App().main() File "/Users/jonathan/glacier/glacier-cli/glacier", line 680, in main args.func(args) File "/Users/jonathan/glacier/glacier-cli/glacier", line 482, in archive_upload archive_id = vault.create_archive_from_file(file_obj=args.file, description=name) File "/Users/jonathan/glacier/glacier-cli/boto/glacier/vault.py", line 141, in create_archive_from_file writer.write(data) File "/Users/jonathan/glacier/glacier-cli/boto/glacier/writer.py", line 152, in write self.send_part() File "/Users/jonathan/glacier/glacier-cli/boto/glacier/writer.py", line 141, in send_part content_range, part) File "/Users/jonathan/glacier/glacier-cli/boto/glacier/layer1.py", line 626, in upload_part response_headers=response_headers) File "/Users/jonathan/glacier/glacier-cli/boto/glacier/layer1.py", line 83, in make_request data=data) File "/Users/jonathan/glacier/glacier-cli/boto/connection.py", line 913, in make_request return self._mexe(http_request, sender, override_num_retries) File "/Users/jonathan/glacier/glacier-cli/boto/connection.py", line 859, in _mexe raise e socket.gaierror: [Errno 8] nodename nor servname provided, or not known

dengkai commented 11 years ago

I'm running into this issue as well with large files (> 1GB), same error:

Traceback (most recent call last): File "./glacier.py", line 730, in App().main() File "./glacier.py", line 716, in main self.args.func() File "./glacier.py", line 498, in archive_upload file_obj=self.args.file, description=name) File "/Users/user/glacier/glacier-cli/boto/glacier/vault.py", line 141, in create_archive_from_file writer.write(data) File "/Users/user/glacier/glacier-cli/boto/glacier/writer.py", line 152, in write self.send_part() File "/Users/user/glacier/glacier-cli/boto/glacier/writer.py", line 141, in send_part content_range, part) File "/Users/user/glacier/glacier-cli/boto/glacier/layer1.py", line 626, in upload_part response_headers=response_headers) File "/Users/user/glacier/glacier-cli/boto/glacier/layer1.py", line 83, in make_request data=data) File "/Users/user/glacier/glacier-cli/boto/connection.py", line 913, in make_request return self._mexe(http_request, sender, override_num_retries) File "/Users/user/glacier/glacier-cli/boto/connection.py", line 859, in _mexe raise e socket.gaierror: [Errno 8] nodename nor servname provided, or not known

basak commented 11 years ago

This is an issue within boto, not in glacier-cli directly. Please could anyone still affected post the version of boto you're using, and try the latest?

dengkai commented 11 years ago

Was seeing the issue with boto 2.5.2 but I just updated to 2.9.6 and the issue persists.

strobe33333 commented 9 years ago

I too am having this problem