Open epicfaace opened 3 years ago
Better logs:
158324.62MiB / 158324.62MiB (100.0%) [39.51MiB/sec] About to read the response... url: /bundles/0xe3f8cd0908
a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Socket timeout, retrying url: /bundles/0xe3f8cd0908a34598bad6ccaeb3f453cf/contents/blob/
Traceback (most recent call last):
File "/home/azureuser/codalab-worksheets/codalab/client/json_api_client.py", line 18, in wrapper
return f(*args, **kwargs)
File "/home/azureuser/codalab-worksheets/codalab/client/json_api_client.py", line 660, in upload_contents_blob
progress_callback=progress_callback,
File "/home/azureuser/codalab-worksheets/codalab/worker/rest_client.py", line 188, in _upload_with_chunked_encoding
StringIO(response.read().decode()),
urllib.error.HTTPError: HTTP Error 504: Gateway Time-out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/azureuser/venv/bin/cl", line 33, in <module>
sys.exit(load_entry_point('codalab', 'console_scripts', 'cl')())
File "/home/azureuser/codalab-worksheets/codalab/bin/cl.py", line 10, in main
cli.do_command(sys.argv[1:])
File "/home/azureuser/codalab-worksheets/codalab/lib/bundle_cli.py", line 954, in do_command
structured_result = command_fn()
File "/home/azureuser/codalab-worksheets/codalab/lib/bundle_cli.py", line 948, in <lambda>
command_fn = lambda: args.function(self, args)
File "/home/azureuser/codalab-worksheets/codalab/lib/bundle_cli.py", line 1422, in do_upload_command
progress_callback=progress.update,
File "/home/azureuser/codalab-worksheets/codalab/client/json_api_client.py", line 38, in wrapper
sys.exc_info()[2],
File "/home/azureuser/venv/lib/python3.6/site-packages/six.py", line 702, in reraise
raise value.with_traceback(tb)
File "/home/azureuser/codalab-worksheets/codalab/client/json_api_client.py", line 18, in wrapper
return f(*args, **kwargs)
File "/home/azureuser/codalab-worksheets/codalab/client/json_api_client.py", line 660, in upload_contents_blob
progress_callback=progress_callback,
File "/home/azureuser/codalab-worksheets/codalab/worker/rest_client.py", line 188, in _upload_with_chunked_encoding
StringIO(response.read().decode()),
codalab.client.json_api_client.JsonApiException: Unable to upload contents of bundle 0xe3f8cd0908a34598bad6ccaeb3f453cf: Gateway Timeout - <html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.12.0</center>
</body>
</html>
real 86m51.066s
user 4m30.364s
sys 3m50.470s
Test this without blob -- if this doesn't work then, then just go ahead and bypass the server for blob storage uploads.
A similar issue happens without blob storage as well:
Bundle: https://worksheets-dev.codalab.org/bundles/0x79862fd6ae754dffa011a071dbe18212
CLI traceback:
vm-clws-dev-server-0:~/ashwin% time cl upload inet.tar.gz [46/66]
Preparing upload archive...
Uploading inet.tar.gz (0x79862fd6ae754dffa011a071dbe18212) to https://worksheets-dev.codalab.org
Sent 68.53MiB / 158324.62MiB (0.0%) [50.61MiB/sec]
Traceback (most recent call last):
File "/home/azureuser/codalab-worksheets/codalab/client/json_api_client.py", line 18, in wrapper
return f(*args, **kwargs)
File "/home/azureuser/codalab-worksheets/codalab/client/json_api_client.py", line 660, in upload_cont$
nts_blob
progress_callback=progress_callback,
File "/home/azureuser/codalab-worksheets/codalab/worker/rest_client.py", line 157, in _upload_with_ch$
nked_encoding
conn.send(b'%X\r\n%s\r\n' % (len(to_send), to_send))
File "/usr/lib/python3.6/http/client.py", line 1002, in send
self.sock.sendall(data)
File "/usr/lib/python3.6/ssl.py", line 975, in sendall
v = self.send(byte_view[count:])
File "/usr/lib/python3.6/ssl.py", line 944, in send
return self._sslobj.write(data)
File "/usr/lib/python3.6/ssl.py", line 642, in write [27/66]
return self._sslobj.write(data)
socket.timeout: The write operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/azureuser/venv/bin/cl", line 33, in <module>
sys.exit(load_entry_point('codalab', 'console_scripts', 'cl')())
File "/home/azureuser/codalab-worksheets/codalab/bin/cl.py", line 10, in main
cli.do_command(sys.argv[1:])
File "/home/azureuser/codalab-worksheets/codalab/lib/bundle_cli.py", line 954, in do_command
structured_result = command_fn()
File "/home/azureuser/codalab-worksheets/codalab/lib/bundle_cli.py", line 948, in <lambda>
command_fn = lambda: args.function(self, args)
File "/home/azureuser/codalab-worksheets/codalab/lib/bundle_cli.py", line 1422, in do_upload_command
progress_callback=progress.update,
File "/home/azureuser/codalab-worksheets/codalab/client/json_api_client.py", line 52, in wrapper
sys.exc_info()[2],
File "/home/azureuser/venv/lib/python3.6/site-packages/six.py", line 702, in reraise [8/66]
raise value.with_traceback(tb)
File "/home/azureuser/codalab-worksheets/codalab/client/json_api_client.py", line 18, in wrapper
return f(*args, **kwargs)
File "/home/azureuser/codalab-worksheets/codalab/client/json_api_client.py", line 660, in upload_conte
nts_blob
progress_callback=progress_callback,
File "/home/azureuser/codalab-worksheets/codalab/worker/rest_client.py", line 157, in _upload_with_chu
nked_encoding
conn.send(b'%X\r\n%s\r\n' % (len(to_send), to_send))
File "/usr/lib/python3.6/http/client.py", line 1002, in send
self.sock.sendall(data)
File "/usr/lib/python3.6/ssl.py", line 975, in sendall
v = self.send(byte_view[count:])
File "/usr/lib/python3.6/ssl.py", line 944, in send
return self._sslobj.write(data)
File "/usr/lib/python3.6/ssl.py", line 642, in write
return self._sslobj.write(data)
codalab.client.json_api_client.JsonApiException: Unable to upload contents of bundle 0x79862fd6ae754dffa
011a071dbe18212: The write operation timed out
real 1m2.959s
user 0m1.530s
sys 0m0.309s
Server traceback for bundle:
Traceback (most recent call last): File "/opt/codalab-worksheets/codalab/lib/zip_util.py", line 65, in unpack un_tar_directory(source, dest_path, 'gz') File "/opt/codalab-worksheets/codalab/worker/un_tar_directory.py", line 30, in un_tar_directory tar.extract(member, directory_path) File "/usr/lib/python3.6/tarfile.py", line 2054, in extract numeric_owner=numeric_owner) File "/usr/lib/python3.6/tarfile.py", line 2124, in _extract_member self.makefile(tarinfo, targetpath) File "/usr/lib/python3.6/tarfile.py", line 2173, in makefile copyfileobj(source, target, tarinfo.size, ReadError, bufsize) File "/usr/lib/python3.6/tarfile.py", line 249, in copyfileobj buf = src.read(bufsize) File "/usr/lib/python3.6/tarfile.py", line 539, in read buf = self._read(size) File "/usr/lib/python3.6/tarfile.py", line 552, in _read buf = self.read(self.bufsize) File "/usr/lib/python3.6/tarfile.py", line 572, in read buf = self.fileobj.read(self.bufsize) File "/usr/local/lib/python3.6/dist-packages/gunicorn/http/body.py", line 215, in read data = self.reader.read(1024) File "/usr/local/lib/python3.6/dist-packages/gunicorn/http/body.py", line 30, in read self.buf.write(next(self.parser)) File "/usr/local/lib/python3.6/dist-packages/gunicorn/http/body.py", line 65, in parse_chunked raise NoMoreData() gunicorn.http.errors.NoMoreData: No more data after: None During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/codalab-worksheets/codalab/rest/bundles.py", line 798, in _update_bundle_contents_blob use_azure_blob_beta=use_azure_blob_beta, File "/opt/codalab-worksheets/codalab/lib/upload_manager.py", line 217, in upload_to_bundle_store bundle, source, git, unpack File "/opt/codalab-worksheets/codalab/lib/upload_manager.py", line 74, in upload_to_bundle_store self.write_fileobj(source_ext, source_fileobj, bundle_path, unpack_archive=True) File "/opt/codalab-worksheets/codalab/lib/upload_manager.py", line 139, in write_fileobj zip_util.unpack(source_ext, source_fileobj, bundle_path) File "/opt/codalab-worksheets/codalab/lib/zip_util.py", line 79, in unpack raise UsageError('Invalid archive upload: failed to unpack archive.') codalab.common.UsageError: Invalid archive upload: failed to unpack archive.
Server docker logs:
2021-07-14 00:48:28,476 Invalid archive upload: failed to unpack archive: No more data after: None
Similar issue reported (without blob storage): https://worksheets.codalab.org/bundles/0x9adb6e45443f4b66bf06e2fe2cc70b83
Another issue reported by @teetone (with blob storage): https://worksheets.codalab.org/bundles/0x5ea7af4eb2fc41ee9b7bffe9531e6975
Run test for
Blob storage (Azure), Upload ImageNet, from NLP machine to Dev environment.
Disk storage, Upload ImageNet, from NLP machine to Dev environment.
Upload to disk storage works well for large files. Close this issue if not happen again.
Reopen this issue because I met this problem again when uploading from NLP server -> prod environment. Last time upload success from my local PC -> Dev environment. (without blob storage)
Possible reason:
Will test it out and check the gateway timeout issue.
codalab@scdt:~$ time cl upload /juice/scr/nlp/imagenet
WARNING:root:Error when list the child path. Ignore the files under path: /juice/scr/nlp/imagenet/.zfs/shares
WARNING:root:Error when list the child path. Ignore the files under path: /juice/scr/nlp/imagenet/imagenet-a
Preparing upload archive...
tar: ./.zfs/shares: Cannot open: Stale file handle
Uploading imagenet.tar.gz (0x7245cea1fe204393b97fa3fd3206a4dd) to https://worksheets.codalab.org
Sent 148089.25MiB [6.74MiB/sec] tar: ./imagenet-a: Cannot open: Permission denied
Sent 164859.02MiB [6.76MiB/sec] tar: Exiting with failure status due to previous errors
Sent 164859.02MiB [6.76MiB/sec]
Traceback (most recent call last):
File "/u/nlp/anaconda/main/anaconda3/envs/default-py37/lib/python3.7/site-packages/codalab/client/json_api_client.py", line 18, in wrapper
return f(*args, **kwargs)
File "/u/nlp/anaconda/main/anaconda3/envs/default-py37/lib/python3.7/site-packages/codalab/client/json_api_client.py", line 660, in upload_contents_blob
progress_callback=progress_callback,
File "/u/nlp/anaconda/main/anaconda3/envs/default-py37/lib/python3.7/site-packages/codalab/worker/rest_client.py", line 143, in _upload_with_chunked_encoding
progress_callback=progress_callback,
File "/u/nlp/anaconda/main/anaconda3/envs/default-py37/lib/python3.7/site-packages/codalab/worker/upload_util.py", line 102, in upload_with_chunked_encoding
StringIO(response.read().decode()),
urllib.error.HTTPError: HTTP Error 504: Gateway Time-out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/u/nlp/bin/cl", line 10, in <module>
sys.exit(main())
File "/u/nlp/anaconda/main/anaconda3/envs/default-py37/lib/python3.7/site-packages/codalab/bin/cl.py", line 10, in main
cli.do_command(sys.argv[1:])
File "/u/nlp/anaconda/main/anaconda3/envs/default-py37/lib/python3.7/site-packages/codalab/lib/bundle_cli.py", line 951, in do_command
structured_result = command_fn()
File "/u/nlp/anaconda/main/anaconda3/envs/default-py37/lib/python3.7/site-packages/codalab/lib/bundle_cli.py", line 945, in <lambda>
command_fn = lambda: args.function(self, args)
File "/u/nlp/anaconda/main/anaconda3/envs/default-py37/lib/python3.7/site-packages/codalab/lib/bundle_cli.py", line 1481, in do_upload_command
destination_bundle_store=metadata.get('store'),
File "/u/nlp/anaconda/main/anaconda3/envs/default-py37/lib/python3.7/site-packages/codalab/lib/upload_manager.py", line 490, in upload_to_bundle_store
progress_callback=progress.update,
File "/u/nlp/anaconda/main/anaconda3/envs/default-py37/lib/python3.7/site-packages/codalab/client/json_api_client.py", line 38, in wrapper
sys.exc_info()[2],
File "/u/nlp/anaconda/main/anaconda3/envs/default-py37/lib/python3.7/site-packages/six.py", line 702, in reraise
raise value.with_traceback(tb)
File "/u/nlp/anaconda/main/anaconda3/envs/default-py37/lib/python3.7/site-packages/codalab/client/json_api_client.py", line 18, in wrapper
return f(*args, **kwargs)
File "/u/nlp/anaconda/main/anaconda3/envs/default-py37/lib/python3.7/site-packages/codalab/client/json_api_client.py", line 660, in upload_contents_blob
progress_callback=progress_callback,
File "/u/nlp/anaconda/main/anaconda3/envs/default-py37/lib/python3.7/site-packages/codalab/worker/rest_client.py", line 143, in _upload_with_chunked_encoding
progress_callback=progress_callback,
File "/u/nlp/anaconda/main/anaconda3/envs/default-py37/lib/python3.7/site-packages/codalab/worker/upload_util.py", line 102, in upload_with_chunked_encoding
StringIO(response.read().decode()),
codalab.client.json_api_client.JsonApiException: Unable to upload contents of bundle 0x7245cea1fe204393b97fa3fd3206a4dd: Gateway Timeout - <html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.12.0</center>
</body>
</html>
real 428m54.356s
user 7m28.461s
sys 8m49.227s
For non-blob storage (disk storage):
Nginx logs:
2023/01/03 22:33:49 [error] 9#9: *2220383 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 152.3.43.43, server: localhost, request: "PUT https://worksheets.codalab.org/rest/bundles/0xe84d63706c3442109c2aaff9defef85f/contents/blob/?filename=imagenet.tar.gz&unpack=1&state_on_success=ready&finalize_on_success=1&use_azure_blob_beta=0&store=&finalize_on_failure=1 HTTP/1.1", upstream: "http://172.27.0.3:2900/rest/bundles/0xe84d63706c3442109c2aaff9defef85f/contents/blob/?filename=imagenet.tar.gz&unpack=1&state_on_success=ready&finalize_on_success=1&use_azure_blob_beta=0&store=&finalize_on_failure=1", host: "worksheets.codalab.org"
152.3.43.43 - - [03/Jan/2023:22:33:49 +0000] "PUT https://worksheets.codalab.org/rest/bundles/0xe84d63706c3442109c2aaff9defef85f/contents/blob/?filename=imagenet.tar.gz&unpack=1&state_on_success=ready&finalize_on_success=1&use_azure_blob_beta=0&store=&finalize_on_failure=1 HTTP/1.1" 504 183 "-" "-"
Analysis:
To simplify it further, this 504 error occurs when two servers are involved in processing a request. The first server (typically the main server) times out, waiting for a response from the second server (upstream server).
Can we set the connection timeout of nginx conf (here) to larger value, eg 3600s?
More tests:
(local PC -> prod server), upload compressed version
(local PC -> prod server), upload uncompressed version
➜ imagenet-all time cl upload imagenet
Preparing upload archive... Uploading imagenet.tar.gz (0x9298b7d8dbf0458694053df1ca0e7f72) to https://worksheets.codalab.org Sent 155844.27MiB [10.81MiB/sec] Traceback (most recent call last): File "/Users/wangjiani/wjn/Code/codalab/codalab-worksheets/codalab/client/json_api_client.py", line 18, in wrapper return f(*args, **kwargs) File "/Users/wangjiani/wjn/Code/codalab/codalab-worksheets/codalab/client/json_api_client.py", line 660, in upload_contents_blob progress_callback=progress_callback, File "/Users/wangjiani/wjn/Code/codalab/codalab-worksheets/codalab/worker/rest_client.py", line 143, in _upload_with_chunked_encoding progress_callback=progress_callback, File "/Users/wangjiani/wjn/Code/codalab/codalab-worksheets/codalab/worker/upload_util.py", line 102, in upload_with_chunked_encoding StringIO(response.read().decode()), urllib.error.HTTPError: HTTP Error 504: Gateway Time-out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/wangjiani/.pyenv/versions/3.6.13/bin/cl", line 33, in
cl upload imagenet 7087.34s user 2466.23s system 56% cpu 4:43:53.60 total
- Step 2 (timing the unpack the file on rest server to check if it's timeout) is hard to test because every time we test it will take long time
More tests:
/u/scr/nlp/eix/imagenet/
) & uncompressed (/u/scr/nlp/imagenet
)/juice/scr/nlp/imagenet
)Upload time:
What's the status here?
Tried uploading ImageNet. It appears that there's a gateway timeout because the socket dies when the server is creating the index for the file. The solution might be for the client to keep sending bytes down the socket to keep the connection alive.
https://worksheets-dev.codalab.org/bundles/0x287f77a8e91d4ed986e69dab6ece92c2