Closed yarikoptic closed 3 years ago
I tried again just now, here the output (time is BST):
2020-07-28 22:29:16,190 [ INFO] Found 3 files to consider
PATH SIZE ERRORS UPLOAD STATUS MESSAGE
PATH SIZE ERRORS UPLOAD STATUS MESSAGE
dandiset.yaml 630 Bytes 0 skipped exists (same)
sub-001/sub-001_ses-20200727T094620_ecephys.nwb 29.5 GB 0 ERROR ('Connection aborted.', OSError("(110, 'ETIMEDOUT')"))
sub-001/sub-001_ses-20200727T094534.nwb 8.4 MB 0 skipped exists (same)
Summary: 29.5 GB 2 skipped 2 exists (same)
1 ERROR 1 ('Connection aborted.', OSError("(110, 'ETIMEDOUT')"))
It finished at Tue 28 Jul 22:49:12 BST 2020, so ran for about 20 min. dandi-cli version is 0.5.0, installed via pip. System is linux, an internal distribution based on Scientific Linux:
LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: Scientific
Description: Scientific Linux release 7.6 (Nitrogen)
Release: 7.6
Codename: Nitrogen
Python is 3.7.2, in a miniconda env. Just let me know if there are any further tests I can do. Note that the smaller NWB file transferred without problems.
which dandiset on https://gui.dandiarchive.org/ it was uploaded to? (I thought you did directly in girder web ui) or it was not uploaded yet?
It was https://dandiarchive.org/dandiset/000028/draft
nginx logs had a number of errors like this
[28/Jul/2020:22:04:47 +0000] "POST /api/v1/file/chunk?offset=67108864&uploadId=5f20a02ef63d62e1dbd06959 HTTP/1.1" 408 0 "-" "python-requests/2.21.0"
The return code 408 is because the client wasn't sending data before a timeout.
From slack conversation: upload attempts directly from the browser resulted in a similar issue but seems already directly relating to connection with S3. As far as I see it;
From browser - girder provides direct upload URL to S3, but that one has a (similar?) choking point (does it use multipart?).
dandi-cli uses girder python client which does uploads of chunks to girder server (not directly to S3):
uploadObj = self.post(
'file/chunk?offset=%d&uploadId=%s' % (offset, uploadId),
data=_ProgressBytesIO(chunk, reporter=reporter))
and we never saw similar choking before I doubt that it is a client "falling asleep" issue although it could be. Unfortunately this code is in girder client and there is no logging of any kind, so to provide more timing etc information the only place I could add it is probably within progress reporting if we think it might help to time things up.
girder server is then orchestrating upload to S3, which might (or not) be multipart; my guess is that at some point something chokes on girder server side with S3 upload and it leads it to throw client connection into timeout. Is girder "streaming" to S3 or first stores that file on our girder instance (memory? disk?)? Does it have local sufficient disk space (if doesn't stream)/memory to accommodate those big files?
meanwhile, @mgrauer -- could girder configuration be tuned up to increase timeout eg 4 times (to 8 minutes), or you think that wouldn't help?
As for discussion of a more general "resolution", I have started https://github.com/dandi/dandi-publish/issues/110 .
@mhhennig just checking -- do you have admin access on any of those boxes from which upload fails? may be we could craft a netstat
(or alike) monitoring invocation so we would get precise timed protocol of interactions with our server .
No, I have no root access unfortunately. There's no good way atm to do monitoring, our building is still closed and here I only have slow broadband and a laptop...the machines I use are in some server room. Could try to ask someone else in a country with fewer infections to do testing though if it would help.
just to make it clear: I didn't need any "physical" access. But since no root -- it answers my question, thanks. I will quickly add some logging at dandi-cli level for as much as it can do by itself ;-) actually since it is all python, I could probably do LOTS! ;) I will buzz you back shortly...
Ok, now there is a https://github.com/dandi/dandi-cli/pull/137 . Please let me know if instructions below are too cumbersome etc, then I will just issue a new dandi-cli release on pypi (I would prefer to avoid it if possible -- needs more testing etc)
py=3; d=venvs/dev$py; virtualenv --python=python$py --system-site-packages $d && source $d/bin/activate && pip3 install git+https://github.com/dandi/dandi-cli.git@enh-logging
DANDI_LOG_GIRDER=1 DANDI_DEVEL=1 dandi -l DEBUG upload --devel-debug --existing force sub-001/sub-001_ses-20200727T094620_ecephys.nwb 2>&1 | tee /tmp/mydandi.log
share that /tmp/mydandi.log
entirely or just a tail of it around the point of failure.
You should see records like (in below I am uploading to a local instance, so do not copy all the options from it, but from above):
lena:/tmp/000027
$> DANDI_LOG_GIRDER=1 DANDI_DEVEL=1 dandi -l DEBUG upload --devel-debug --existing force --validation ignore -i local-docker sub-759711149 2>&1 | tee /tmp/mydandi.log
...
2020-07-29 11:20:28,824 [ DEBUG] PROGRESS[sub-759711149/sub-759711149.nwb]: 1409286144
2020-07-29 11:20:28,889 [ DEBUG] REST>: args=('POST', 'file/chunk?offset=1409286144&uploadId=5f2193b582d2cb7363bd3131', None) kwargs={'files': None, 'data': <girder_client._ProgressBytesIO object at 0x7fce0a12d7c0>, 'json': None, 'headers': None, 'jsonResp': True}
2020-07-29 11:20:29,188 [ DEBUG] REST<: {'_id': '5f2193b582d2cb7363bd3131', 'assetstoreId': '5e6ab9e6ed07da5d182247e2', 'created': '2020-07-29T15:20:21.605000+00:00', 'mimeType': 'application/octet-stream', 'name': 'sub-759711149.nwb', 'parentId': '5f2193b582d2cb7363bd3130', 'parentType': 'item', 'received': 1476395008, 'sha512state': "b'39326e76f590807e3811aa6b9ab45a46d6e03b28586c7a87ddd02403251a83ec57ce4a15176a823fc606f9ea58e1f9df1d964564f292ac3c328aeaf1a33558fb000000c002000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000040000000'", 'size': 2251686372, 'tempFile': '/home/assetstore/temp/tmp147o7mil', 'updated': '2020-07-29T15:20:29.186537+00:00', 'userId': '5e6ab9e5ed07da5d182247de'}
2020-07-29 11:20:29,188 [ DEBUG] PROGRESS[sub-759711149/sub-759711149.nwb]: 1476395008
2020-07-29 11:20:29,239 [ DEBUG] REST>: args=('POST', 'file/chunk?offset=1476395008&uploadId=5f2193b582d2cb7363bd3131', None) kwargs={'files': None, 'data': <girder_client._ProgressBytesIO object at 0x7fce02b20040>, 'json': None, 'headers': None, 'jsonResp': True}
2020-07-29 11:20:29,527 [ DEBUG] REST<: {'_id': '5f2193b582d2cb7363bd3131', 'assetstoreId': '5e6ab9e6ed07da5d182247e2', 'created': '2020-07-29T15:20:21.605000+00:00', 'mimeType': 'application/octet-stream', 'name': 'sub-759711149.nwb', 'parentId': '5f2193b582d2cb7363bd3130', 'parentType': 'item', 'received': 1543503872, 'sha512state': "b'da48bee581e4ac6706578eac6a5d640cc58dff51f1d86b047bf7f9ac3d6a4e16a45e0dcbcc43bc1d58d07c3310a1c727dc2c5d4281425c54d4f3e15b9f3e436d000000e002000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000040000000'", 'size': 2251686372, 'tempFile': '/home/assetstore/temp/tmp147o7mil', 'updated': '2020-07-29T15:20:29.525607+00:00', 'userId': '5e6ab9e5ed07da5d182247de'}
2020-07-29 11:20:29,527 [ DEBUG] PROGRESS[sub-759711149/sub-759711149.nwb]: 1543503872
so I expect us to get a better clue on when we are submitting a request and receiving response from the previous one etc.
Just ran this, here's the end of the debug log:
2020-07-29 16:36:57,785 [ DEBUG] REST<: {'_id': '5f2196b2f63d62e1dbd0695f', 'assetstoreId': '5d9618588a600b0dcb25b4a6', 'behavior': 's3', 'created': '2020-07-29T15:33:06.099000+00:00', 'mimeType': 'application/octet-stream', 'name': 'sub-001_ses-20200727T094620_ecephys.nwb', 'parentId': '5f219599f63d62e1dbd0695e', 'parentType': 'item', 'received': 201326592, 's3': {'chunkLength': 33554432, 'chunked': True, 'key': 'girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97', 'keyName': 'girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97', 'partNumber': 3, 'relpath': '/dandiarchive/girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97', 'request': {'headers': {'Content-Disposition': 'attachment; filename="sub-001_ses-20200727T094620_ecephys.nwb"', 'Content-Type': 'application/octet-stream', 'x-amz-acl': 'private', 'x-amz-meta-uploader-id': '5f1de355f63d62e1dbd06948', 'x-amz-meta-uploader-ip': '129.215.24.71'}, 'method': 'POST', 'url': 'https://dandiarchive.s3.amazonaws.com/girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97?uploads=&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA3GIMZPVVECMGAIM2%2F20200729%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20200729T153306Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=content-disposition%3Bcontent-type%3Bhost%3Bx-amz-acl%3Bx-amz-meta-uploader-id%3Bx-amz-meta-uploader-ip&X-Amz-Security-Token=FwoGZXIvYXdzEGEaDGsgV7CrjJVSz59JKyK%2FAU%2FUNxUv3xA48rHUvzjYSpq5auW3ZbnGlfDyn2A6g5zIWV%2Brok0nX0lSEVsyNZ4TDQsr%2BGswi7vOzxd%2B6pHCZ8CTAjs%2B67oQZHGbMvZcMMD4pmR%2F0w%2FHp4BB4%2BHdqLCHeGdw0GtEWT%2FXytlEhPhZBB4zdCj6iPgLfJ2X6bjl0t8HOMnrIZGGiCHzq2e55za2INruG%2FGigJcG8JRTN58oYDlyn5K67YWPH8WQ99o0Fs%2FJP4MG7b2G0ulUOiG93BITKLKthvkFMi2GGQVGOULBpIE7muawEba02mirETYvdYu21p%2Flz0%2BstFokKUCKRUWh7UfdEh0%3D&X-Amz-Signature=0c2f7c886ad500041f83ca70e12d7d02fac8b415eb5584f6a48a9d41d2e9478a'}, 'uploadId': 'LWiP5rLWrTifay_RXeV6rJ19k1r4_gP1ONrIGv4sboDVoxxMlSiPTAI43No7EoglKlbqpVrzsQec8iqnM.pJa_sqjDs2G3480OOephWaXdE3sSkyOzc9uZPAy_me4Hq0nRgB_3EyEVDkkv7Issjagod.itpU.5HuQNs2EIcbTJs-'}, 'size': 29521959808, 'updated': '2020-07-29T15:36:57.711454+00:00', 'userId': '5f1de355f63d62e1dbd06948'}
2020-07-29 16:36:57,785 [ DEBUG] PROGRESS[sub-001/sub-001_ses-20200727T094620_ecephys.nwb]: 201326592
2020-07-29 16:36:57,854 [ DEBUG] REST>: args=('POST', 'file/chunk?offset=201326592&uploadId=5f2196b2f63d62e1dbd0695f', None) kwargs={'files': None, 'data': <girder_client._ProgressBytesIO object at 0x7f752b8ffe08>, 'json': None, 'headers': None, 'jsonResp': True}
^[^]^]{'size': 29521959808}
{'status': 'checking girder'}
{'status': 'validating'}
{'errors': 0}
{'status': 'extracting metadata'}
{'status': 'digesting'}
{'status': 'uploading'}
{'upload': 0.22731845865400346}
{'upload': 0.4546369173080069}
{'upload': 0.6819553759620104}
Traceback (most recent call last):
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 340, in _send_until_done
return self.connection.send(data)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1757, in send
self._raise_ssl_error(self._ssl, result)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1663, in _raise_ssl_error
raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (110, 'ETIMEDOUT')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
chunked=chunked,
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/http/client.py", line 1229, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/http/client.py", line 1275, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/http/client.py", line 1224, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/http/client.py", line 1055, in _send_output
self.send(chunk)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/http/client.py", line 977, in send
self.sock.sendall(data)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 352, in sendall
data[total_sent : total_sent + SSL_WRITE_BLOCKSIZE]
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 346, in _send_until_done
raise SocketError(str(e))
OSError: (110, 'ETIMEDOUT')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/urllib3/util/retry.py", line 400, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/urllib3/packages/six.py", line 734, in reraise
raise value.with_traceback(tb)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
chunked=chunked,
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/http/client.py", line 1229, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/http/client.py", line 1275, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/http/client.py", line 1224, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/http/client.py", line 1055, in _send_output
self.send(chunk)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/http/client.py", line 977, in send
self.sock.sendall(data)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 352, in sendall
data[total_sent : total_sent + SSL_WRITE_BLOCKSIZE]
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 346, in _send_until_done
raise SocketError(str(e))
urllib3.exceptions.ProtocolError: ('Connection aborted.', OSError("(110, 'ETIMEDOUT')"))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/bin/dandi", line 8, in <module>
sys.exit(main())
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/dandi/cli/command.py", line 118, in wrapper
return f(*args, **kwargs)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/dandi/cli/cmd_upload.py", line 602, in upload
for v in process_path(path, relpath):
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/dandi/cli/cmd_upload.py", line 472, in process_path
lambda c: client.uploadFileToItem(
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/dandi/support/generatorify.py", line 83, in __next__
return self.send(None)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/dandi/support/generatorify.py", line 77, in send
return self._put("send", value)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/dandi/support/generatorify.py", line 69, in _put
raise val
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/dandi/support/generatorify.py", line 53, in thread_func
ret = func(callback)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/dandi/cli/cmd_upload.py", line 473, in <lambda>
item_rec["_id"], path, progressCallback=c
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/girder_client/__init__.py", line 894, in uploadFileToItem
return self._uploadContents(obj, f, filesize, progressCallback=progressCallback)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/girder_client/__init__.py", line 1010, in _uploadContents
data=_ProgressBytesIO(chunk, reporter=reporter))
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/girder_client/__init__.py", line 495, in post
data=data, json=json, headers=headers, jsonResp=jsonResp)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/dandi/girder.py", line 112, in sendRestRequest
res = super().sendRestRequest(*args, **kwargs)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/girder_client/__init__.py", line 470, in sendRestRequest
**kwargs)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/requests/api.py", line 119, in post
return request('post', url, data=data, json=json, **kwargs)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/requests/sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/requests/sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "/disk/scratch/mhennig/miniconda3/envs/spikesorting/lib/python3.7/site-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', OSError("(110, 'ETIMEDOUT')"))
Great, please edit and pasted a few more lines on top so we could see when previous REST>
was sent (we see only when reply to it was received).
It seems that I might need to flush stdout to get better depiction of order of things (log goes into stderr, and regular lives to stdout) and also report upload ratio in progress log record so we could match... Will do
Here's the whole thing: https://gist.github.com/mhhennig/f7ffec71fd6140f111fa2ee0cc59299d
oh, it is blowing up really quickly! Here is those log lines prepended with duration (in ms, to the next line), so we see that it doesn't take abnormally long to submit that next request which never gets a reply
541 2020-07-29 16:33:05,815 [ DEBUG] REST>: args=('POST', 'file', {'parentType': 'item', 'parentId': '5f219599f63d62e1dbd0695e', 'name': 'sub-001_ses-20200727T094620_ecephys.nwb', 'size': 29521959808, 'mimeType': None}) kwargs={'files': None, 'data': None, 'json': None, 'headers': None, 'jsonResp': True}
62 2020-07-29 16:33:06,356 [ DEBUG] REST<: {'_id': '5f2196b2f63d62e1dbd0695f', 'assetstoreId': '5d9618588a600b0dcb25b4a6', 'behavior': 's3', 'created': '2020-07-29T15:33:06.099026+00:00', 'mimeType': 'application/octet-stream', 'name': 'sub-001_ses-20200727T094620_ecephys.nwb', 'parentId': '5f219599f63d62e1dbd0695e', 'parentType': 'item', 'received': 0, 's3': {'chunkLength': 33554432, 'chunked': True, 'key': 'girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97', 'relpath': '/dandiarchive/girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97', 'request': {'headers': {'Content-Disposition': 'attachment; filename="sub-001_ses-20200727T094620_ecephys.nwb"', 'Content-Type': 'application/octet-stream', 'x-amz-acl': 'private', 'x-amz-meta-uploader-id': '5f1de355f63d62e1dbd06948', 'x-amz-meta-uploader-ip': '129.215.24.71'}, 'method': 'POST', 'url': 'https://dandiarchive.s3.amazonaws.com/girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97?uploads=&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA3GIMZPVVECMGAIM2%2F20200729%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20200729T153306Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=content-disposition%3Bcontent-type%3Bhost%3Bx-amz-acl%3Bx-amz-meta-uploader-id%3Bx-amz-meta-uploader-ip&X-Amz-Security-Token=FwoGZXIvYXdzEGEaDGsgV7CrjJVSz59JKyK%2FAU%2FUNxUv3xA48rHUvzjYSpq5auW3ZbnGlfDyn2A6g5zIWV%2Brok0nX0lSEVsyNZ4TDQsr%2BGswi7vOzxd%2B6pHCZ8CTAjs%2B67oQZHGbMvZcMMD4pmR%2F0w%2FHp4BB4%2BHdqLCHeGdw0GtEWT%2FXytlEhPhZBB4zdCj6iPgLfJ2X6bjl0t8HOMnrIZGGiCHzq2e55za2INruG%2FGigJcG8JRTN58oYDlyn5K67YWPH8WQ99o0Fs%2FJP4MG7b2G0ulUOiG93BITKLKthvkFMi2GGQVGOULBpIE7muawEba02mirETYvdYu21p%2Flz0%2BstFokKUCKRUWh7UfdEh0%3D&X-Amz-Signature=0c2f7c886ad500041f83ca70e12d7d02fac8b415eb5584f6a48a9d41d2e9478a'}}, 'size': 29521959808, 'updated': '2020-07-29T15:33:06.294875+00:00', 'userId': '5f1de355f63d62e1dbd06948'}
77271 2020-07-29 16:33:06,418 [ DEBUG] REST>: args=('POST', 'file/chunk?offset=0&uploadId=5f2196b2f63d62e1dbd0695f', None) kwargs={'files': None, 'data': <girder_client._ProgressBytesIO object at 0x7f752b8ffe60>, 'json': None, 'headers': None, 'jsonResp': True}
0 2020-07-29 16:34:23,689 [ DEBUG] REST<: {'_id': '5f2196b2f63d62e1dbd0695f', 'assetstoreId': '5d9618588a600b0dcb25b4a6', 'behavior': 's3', 'created': '2020-07-29T15:33:06.099000+00:00', 'mimeType': 'application/octet-stream', 'name': 'sub-001_ses-20200727T094620_ecephys.nwb', 'parentId': '5f219599f63d62e1dbd0695e', 'parentType': 'item', 'received': 67108864, 's3': {'chunkLength': 33554432, 'chunked': True, 'key': 'girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97', 'keyName': 'girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97', 'partNumber': 1, 'relpath': '/dandiarchive/girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97', 'request': {'headers': {'Content-Disposition': 'attachment; filename="sub-001_ses-20200727T094620_ecephys.nwb"', 'Content-Type': 'application/octet-stream', 'x-amz-acl': 'private', 'x-amz-meta-uploader-id': '5f1de355f63d62e1dbd06948', 'x-amz-meta-uploader-ip': '129.215.24.71'}, 'method': 'POST', 'url': 'https://dandiarchive.s3.amazonaws.com/girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97?uploads=&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA3GIMZPVVECMGAIM2%2F20200729%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20200729T153306Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=content-disposition%3Bcontent-type%3Bhost%3Bx-amz-acl%3Bx-amz-meta-uploader-id%3Bx-amz-meta-uploader-ip&X-Amz-Security-Token=FwoGZXIvYXdzEGEaDGsgV7CrjJVSz59JKyK%2FAU%2FUNxUv3xA48rHUvzjYSpq5auW3ZbnGlfDyn2A6g5zIWV%2Brok0nX0lSEVsyNZ4TDQsr%2BGswi7vOzxd%2B6pHCZ8CTAjs%2B67oQZHGbMvZcMMD4pmR%2F0w%2FHp4BB4%2BHdqLCHeGdw0GtEWT%2FXytlEhPhZBB4zdCj6iPgLfJ2X6bjl0t8HOMnrIZGGiCHzq2e55za2INruG%2FGigJcG8JRTN58oYDlyn5K67YWPH8WQ99o0Fs%2FJP4MG7b2G0ulUOiG93BITKLKthvkFMi2GGQVGOULBpIE7muawEba02mirETYvdYu21p%2Flz0%2BstFokKUCKRUWh7UfdEh0%3D&X-Amz-Signature=0c2f7c886ad500041f83ca70e12d7d02fac8b415eb5584f6a48a9d41d2e9478a'}, 'uploadId': 'LWiP5rLWrTifay_RXeV6rJ19k1r4_gP1ONrIGv4sboDVoxxMlSiPTAI43No7EoglKlbqpVrzsQec8iqnM.pJa_sqjDs2G3480OOephWaXdE3sSkyOzc9uZPAy_me4Hq0nRgB_3EyEVDkkv7Issjagod.itpU.5HuQNs2EIcbTJs-'}, 'size': 29521959808, 'updated': '2020-07-29T15:34:23.617214+00:00', 'userId': '5f1de355f63d62e1dbd06948'}
69 2020-07-29 16:34:23,689 [ DEBUG] PROGRESS[sub-001/sub-001_ses-20200727T094620_ecephys.nwb]: 67108864
77242 2020-07-29 16:34:23,758 [ DEBUG] REST>: args=('POST', 'file/chunk?offset=67108864&uploadId=5f2196b2f63d62e1dbd0695f', None) kwargs={'files': None, 'data': <girder_client._ProgressBytesIO object at 0x7f752b8ffe60>, 'json': None, 'headers': None, 'jsonResp': True}
0 2020-07-29 16:35:41,000 [ DEBUG] REST<: {'_id': '5f2196b2f63d62e1dbd0695f', 'assetstoreId': '5d9618588a600b0dcb25b4a6', 'behavior': 's3', 'created': '2020-07-29T15:33:06.099000+00:00', 'mimeType': 'application/octet-stream', 'name': 'sub-001_ses-20200727T094620_ecephys.nwb', 'parentId': '5f219599f63d62e1dbd0695e', 'parentType': 'item', 'received': 134217728, 's3': {'chunkLength': 33554432, 'chunked': True, 'key': 'girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97', 'keyName': 'girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97', 'partNumber': 2, 'relpath': '/dandiarchive/girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97', 'request': {'headers': {'Content-Disposition': 'attachment; filename="sub-001_ses-20200727T094620_ecephys.nwb"', 'Content-Type': 'application/octet-stream', 'x-amz-acl': 'private', 'x-amz-meta-uploader-id': '5f1de355f63d62e1dbd06948', 'x-amz-meta-uploader-ip': '129.215.24.71'}, 'method': 'POST', 'url': 'https://dandiarchive.s3.amazonaws.com/girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97?uploads=&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA3GIMZPVVECMGAIM2%2F20200729%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20200729T153306Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=content-disposition%3Bcontent-type%3Bhost%3Bx-amz-acl%3Bx-amz-meta-uploader-id%3Bx-amz-meta-uploader-ip&X-Amz-Security-Token=FwoGZXIvYXdzEGEaDGsgV7CrjJVSz59JKyK%2FAU%2FUNxUv3xA48rHUvzjYSpq5auW3ZbnGlfDyn2A6g5zIWV%2Brok0nX0lSEVsyNZ4TDQsr%2BGswi7vOzxd%2B6pHCZ8CTAjs%2B67oQZHGbMvZcMMD4pmR%2F0w%2FHp4BB4%2BHdqLCHeGdw0GtEWT%2FXytlEhPhZBB4zdCj6iPgLfJ2X6bjl0t8HOMnrIZGGiCHzq2e55za2INruG%2FGigJcG8JRTN58oYDlyn5K67YWPH8WQ99o0Fs%2FJP4MG7b2G0ulUOiG93BITKLKthvkFMi2GGQVGOULBpIE7muawEba02mirETYvdYu21p%2Flz0%2BstFokKUCKRUWh7UfdEh0%3D&X-Amz-Signature=0c2f7c886ad500041f83ca70e12d7d02fac8b415eb5584f6a48a9d41d2e9478a'}, 'uploadId': 'LWiP5rLWrTifay_RXeV6rJ19k1r4_gP1ONrIGv4sboDVoxxMlSiPTAI43No7EoglKlbqpVrzsQec8iqnM.pJa_sqjDs2G3480OOephWaXdE3sSkyOzc9uZPAy_me4Hq0nRgB_3EyEVDkkv7Issjagod.itpU.5HuQNs2EIcbTJs-'}, 'size': 29521959808, 'updated': '2020-07-29T15:35:40.929474+00:00', 'userId': '5f1de355f63d62e1dbd06948'}
70 2020-07-29 16:35:41,000 [ DEBUG] PROGRESS[sub-001/sub-001_ses-20200727T094620_ecephys.nwb]: 134217728
76715 2020-07-29 16:35:41,070 [ DEBUG] REST>: args=('POST', 'file/chunk?offset=134217728&uploadId=5f2196b2f63d62e1dbd0695f', None) kwargs={'files': None, 'data': <girder_client._ProgressBytesIO object at 0x7f752b8ffc50>, 'json': None, 'headers': None, 'jsonResp': True}
0 2020-07-29 16:36:57,785 [ DEBUG] REST<: {'_id': '5f2196b2f63d62e1dbd0695f', 'assetstoreId': '5d9618588a600b0dcb25b4a6', 'behavior': 's3', 'created': '2020-07-29T15:33:06.099000+00:00', 'mimeType': 'application/octet-stream', 'name': 'sub-001_ses-20200727T094620_ecephys.nwb', 'parentId': '5f219599f63d62e1dbd0695e', 'parentType': 'item', 'received': 201326592, 's3': {'chunkLength': 33554432, 'chunked': True, 'key': 'girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97', 'keyName': 'girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97', 'partNumber': 3, 'relpath': '/dandiarchive/girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97', 'request': {'headers': {'Content-Disposition': 'attachment; filename="sub-001_ses-20200727T094620_ecephys.nwb"', 'Content-Type': 'application/octet-stream', 'x-amz-acl': 'private', 'x-amz-meta-uploader-id': '5f1de355f63d62e1dbd06948', 'x-amz-meta-uploader-ip': '129.215.24.71'}, 'method': 'POST', 'url': 'https://dandiarchive.s3.amazonaws.com/girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97?uploads=&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA3GIMZPVVECMGAIM2%2F20200729%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20200729T153306Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=content-disposition%3Bcontent-type%3Bhost%3Bx-amz-acl%3Bx-amz-meta-uploader-id%3Bx-amz-meta-uploader-ip&X-Amz-Security-Token=FwoGZXIvYXdzEGEaDGsgV7CrjJVSz59JKyK%2FAU%2FUNxUv3xA48rHUvzjYSpq5auW3ZbnGlfDyn2A6g5zIWV%2Brok0nX0lSEVsyNZ4TDQsr%2BGswi7vOzxd%2B6pHCZ8CTAjs%2B67oQZHGbMvZcMMD4pmR%2F0w%2FHp4BB4%2BHdqLCHeGdw0GtEWT%2FXytlEhPhZBB4zdCj6iPgLfJ2X6bjl0t8HOMnrIZGGiCHzq2e55za2INruG%2FGigJcG8JRTN58oYDlyn5K67YWPH8WQ99o0Fs%2FJP4MG7b2G0ulUOiG93BITKLKthvkFMi2GGQVGOULBpIE7muawEba02mirETYvdYu21p%2Flz0%2BstFokKUCKRUWh7UfdEh0%3D&X-Amz-Signature=0c2f7c886ad500041f83ca70e12d7d02fac8b415eb5584f6a48a9d41d2e9478a'}, 'uploadId': 'LWiP5rLWrTifay_RXeV6rJ19k1r4_gP1ONrIGv4sboDVoxxMlSiPTAI43No7EoglKlbqpVrzsQec8iqnM.pJa_sqjDs2G3480OOephWaXdE3sSkyOzc9uZPAy_me4Hq0nRgB_3EyEVDkkv7Issjagod.itpU.5HuQNs2EIcbTJs-'}, 'size': 29521959808, 'updated': '2020-07-29T15:36:57.711454+00:00', 'userId': '5f1de355f63d62e1dbd06948'}
69 2020-07-29 16:36:57,785 [ DEBUG] PROGRESS[sub-001/sub-001_ses-20200727T094620_ecephys.nwb]: 201326592
----- 2020-07-29 16:36:57,854 [ DEBUG] REST>: args=('POST', 'file/chunk?offset=201326592&uploadId=5f2196b2f63d62e1dbd0695f', None) kwargs={'files': None, 'data': <girder_client._ProgressBytesIO object at 0x7f752b8ffe08>, 'json': None, 'headers': None, 'jsonResp': True}%
@mgrauer could you please check how it looked on the server at that point in time?
Same behavior as before with this particular upload from the Python client, this is an error on the client side (not sending data, or not sending enough data in a short enough time), so nginx times out and responds with a 408. Girder never even sees the last request, only the next-to-last request makes it to Girder (these are nginx logs, the last request here does not have a matching request in Girder logs).
access.log:X.X.X.X - - [29/Jul/2020:15:35:40 +0000] "POST /api/v1/file/chunk?offset=67108864&uploadId=5f2196b2f63d62e1dbd0695f HTTP/1.1" 200 2100 "-" "python-requests/2.23.0" access.log:X.X.X.X - - [29/Jul/2020:15:36:57 +0000] "POST /api/v1/file/chunk?offset=134217728&uploadId=5f2196b2f63d62e1dbd0695f HTTP/1.1" 200 2100 "-" "python-requests/2.23.0" access.log:X.X.X.X - - [29/Jul/2020:15:38:57 +0000] "POST /api/v1/file/chunk?offset=201326592&uploadId=5f2196b2f63d62e1dbd0695f HTTP/1.1" 408 0 "-" "python-requests/2.23.0"
@mhhennig also got an error when using the direct to browser upload, which doesn't proxy through Girder (the Python client proxies chunked uploads through Girder) but rather talks directly to S3. That error was
Browser error is: Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at https://dandiarchive.s3.amazonaws.com/girder-assetstore/22/6\u20268d402c82d7deadce5184e7155aefce878b54b6ec8a08caf53b0510371462. (Reason: CORS request did not succeed).
@AlmightyYakob says he's seen CORS errors before on request timeouts, because the server never finishes receiving the request, so it cannot set the correct header on the response.
This would be consistent with what we are seeing through the Python upload path, that is a client side timeout (or falling asleep as you put it @yarikoptic đŸ˜´ ) on the request. The browser allows for a retry when an error happens in uploading, though that would be annoying for someone to retry a large number of errors.
@yarikoptic you could put a retry in the Python client, perhaps similarly to the way the Girder CLI does retries by overriding sendRestRequest
, maybe retrying 5 times? Or you could even expose this retry count as a command line flag if you wanted users to be able to set it--maybe it would be worth doing a hacky version first before exposing it as a CLI argument.
that is a client side timeout (or falling asleep as you put it @yarikoptic sleeping ) on the request.
I would not yet be 100% certain about that. Here is the joint sequence of events from both logs as I think it has happened (there is no subsec times in web server), with some comments in # lines
but I have no clue how/what girder does
76715 2020-07-29 16:35:41,070 [ DEBUG] REST>: args=('POST', 'file/chunk?offset=134217728&uploadId=5f2196b2f63d62e1dbd0695f', None) kwargs={'files': None, 'data': <girder_client._ProgressBytesIO object at 0x7f752b8ffc50>, 'json': None, 'headers': None, 'jsonResp': True}
# (for ~76 secs) data transmission is happening and then girder digests/forwards data to S3,
# receives some ack/nack and replies/logs in the log upon completion of the operation
access.log: [29/Jul/2020:15:36:57 +0000] "POST /api/v1/file/chunk?offset=134217728&uploadId=5f2196b2f63d62e1dbd0695f HTTP/1.1" 200 2100 "-" "python-requests/2.23.0"
0 2020-07-29 16:36:57,785 [ DEBUG] REST<: {'_id': '5f2196b2f63d62e1dbd0695f', 'assetstoreId': '5d9618588a600b0dcb25b4a6', 'behavior': 's3', 'created': '2020-07-29T15:33:06.099000+00:00', 'mimeType': 'application/octet-stream', 'name': 'sub-001_ses-20200727T094620_ecephys.nwb', 'parentId': '5f219599f63d62e1dbd0695e', 'parentType': 'item', 'received': 201326592, 's3': {'chunkLength': 33554432, 'chunked': True, 'key': 'girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97', 'keyName': 'girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97', 'partNumber': 3, 'relpath': '/dandiarchive/girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97', 'request': {'headers': {'Content-Disposition': 'attachment; filename="sub-001_ses-20200727T094620_ecephys.nwb"', 'Content-Type': 'application/octet-stream', 'x-amz-acl': 'private', 'x-amz-meta-uploader-id': '5f1de355f63d62e1dbd06948', 'x-amz-meta-uploader-ip': '129.215.24.71'}, 'method': 'POST', 'url': 'https://dandiarchive.s3.amazonaws.com/girder-assetstore/d8/27/d8278b9868af40419c4133d308eaea97?uploads=&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA3GIMZPVVECMGAIM2%2F20200729%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20200729T153306Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=content-disposition%3Bcontent-type%3Bhost%3Bx-amz-acl%3Bx-amz-meta-uploader-id%3Bx-amz-meta-uploader-ip&X-Amz-Security-Token=FwoGZXIvYXdzEGEaDGsgV7CrjJVSz59JKyK%2FAU%2FUNxUv3xA48rHUvzjYSpq5auW3ZbnGlfDyn2A6g5zIWV%2Brok0nX0lSEVsyNZ4TDQsr%2BGswi7vOzxd%2B6pHCZ8CTAjs%2B67oQZHGbMvZcMMD4pmR%2F0w%2FHp4BB4%2BHdqLCHeGdw0GtEWT%2FXytlEhPhZBB4zdCj6iPgLfJ2X6bjl0t8HOMnrIZGGiCHzq2e55za2INruG%2FGigJcG8JRTN58oYDlyn5K67YWPH8WQ99o0Fs%2FJP4MG7b2G0ulUOiG93BITKLKthvkFMi2GGQVGOULBpIE7muawEba02mirETYvdYu21p%2Flz0%2BstFokKUCKRUWh7UfdEh0%3D&X-Amz-Signature=0c2f7c886ad500041f83ca70e12d7d02fac8b415eb5584f6a48a9d41d2e9478a'}, 'uploadId': 'LWiP5rLWrTifay_RXeV6rJ19k1r4_gP1ONrIGv4sboDVoxxMlSiPTAI43No7EoglKlbqpVrzsQec8iqnM.pJa_sqjDs2G3480OOephWaXdE3sSkyOzc9uZPAy_me4Hq0nRgB_3EyEVDkkv7Issjagod.itpU.5HuQNs2EIcbTJs-'}, 'size': 29521959808, 'updated': '2020-07-29T15:36:57.711454+00:00', 'userId': '5f1de355f63d62e1dbd06948'}
69 2020-07-29 16:36:57,785 [ DEBUG] PROGRESS[sub-001/sub-001_ses-20200727T094620_ecephys.nwb]: 201326592
# we are submitting a new request pretty much right away
----- 2020-07-29 16:36:57,854 [ DEBUG] REST>: args=('POST', 'file/chunk?offset=201326592&uploadId=5f2196b2f63d62e1dbd0695f', None) kwargs={'files': None, 'data': <girder_client._ProgressBytesIO object at 0x7f752b8ffe08>, 'json': None, 'headers': None, 'jsonResp': True}%
# but then at 2 min sharp cut off we are getting 408 response
access.log: [29/Jul/2020:15:38:57 +0000] "POST /api/v1/file/chunk?offset=201326592&uploadId=5f2196b2f63d62e1dbd0695f HTTP/1.1" 408 0 "-" "python-requests/2.23.0"
what I am thinking is:
# The current maximum chunk size for uploading file chunks
MAX_CHUNK_SIZE = 1024 * 1024 * 64
I don't think that retrying requests would be "productive" (if my "slow bandwidth" guess is correct) since they are likely to time out again, and we would just keep trying re-transmit those 64MB at a time...
not sure what is our girder configuration upload_minimum_chunk_size
but it seems that the local one does not mind 64k chunk ;) so, @mhhennig would you be so kind to try this "version":
pip3 install git+https://github.com/dandi/dandi-cli.git@temp-small-chunksize
where I made chunk 8 times smaller. If this one works -- that means that we need to somehow mitigate for slow(er) connections. diff from master if interested.
Yes, I agree. Sorry, I miswrote:
client side timeout
I meant a client side error (slowness, could be from bandwidth, latency, or other causes such as drowsiness) that results in a server side timeout, but you figured it out correctly regardless of what I said, so you were not sleeping :) . I like your solution of smaller chunk size, because we will have less data to send through in that time, so we are more likely to succeed.
You might also want some retries, but it makes sense to experiment with just the smaller chunk size first. 8 times smaller is a lot, it means there will be 8 times as many API requests which may be inefficient. Overall I like this approach and it is a good start to see if it fixes the problem, then we can optimize chunk sizes/retries/adaptive approaches if needed.
Details on timeouts:
It isn't a 2 minute timeout, it just happens to look like that in this case. The below are nginx logs (above logs are also nginx, not girder) from the same (redacted) IP address, this time only 1:01 until the failure from the previous request.
X.X.X.X - - [28/Jul/2020:12:38:37 +0000] "POST /api/v1/file/chunk?offset=671088640&uploadId=5f201901f63d62e1dbd06953 HTTP/1.1" 200 2085 "-" "python-requests/2.23.0" X.X.X.X - - [28/Jul/2020:12:39:54 +0000] "POST /api/v1/file/chunk?offset=738197504&uploadId=5f201901f63d62e1dbd06953 HTTP/1.1" 200 2085 "-" "python-requests/2.23.0" X.X.X.X - - [28/Jul/2020:12:40:55 +0000] "POST /api/v1/file/chunk?offset=805306368&uploadId=5f201901f63d62e1dbd06953 HTTP/1.1" 408 0 "-" "python-requests/2.23.0"
Note that these are nginx timeouts. Girder never even sees the very last chunk request.
These nginx timeouts are likely culprits:
http://nginx.org/en/docs/http/ngx_http_core_module.html#client_body_timeout 60s http://nginx.org/en/docs/http/ngx_http_core_module.html#client_header_timeout 60s http://nginx.org/en/docs/http/ngx_http_core_module.html#lingering_timeout 5s
I am not suggesting that we change these timeouts, I'm just posting details for future reference. Changing these timeouts wouldn't help with the browser path anyway, because we can't control the timeouts on S3. It is a 4XX error, meaning client side, so fixing it on the client is the right approach.
not sure what is our girder configuration
upload_minimum_chunk_size
but it seems that the local one does not mind 64k chunk ;) so, @mhhennig would you be so kind to try this "version":pip3 install git+https://github.com/dandi/dandi-cli.git@temp-small-chunksize
I tried it, again it did not complete. The log is here: https://gist.github.com/mhhennig/f7ffec71fd6140f111fa2ee0cc59299d
I was about to scream "rate limiting", but then realized that I don't think that you got the @temp-small-chunksize
installed... I didn't know that silly pip will not upgrade if already installed this way:
Requirement already satisfied (use --upgrade to upgrade): dandi==0.0.0 from git+https://github.com/dandi/dandi-cli.git@temp-small-chunksize in ./venvs/dev3/lib/python3.8/site-packages
Please do
pip3 install --upgrade git+https://github.com/dandi/dandi-cli.git@temp-small-chunksize
in that "newer" version you should see PROGRESS indicators in a new format (with done=
and %done
):
2020-07-30 09:28:18,701 [ DEBUG] PROGRESS[sub-759711149/sub-759711149_probe-792602654_ecephys.nwb]: done=109051904 %done=6.54531524362948
thank you @mgrauer !
http://nginx.org/en/docs/http/ngx_http_core_module.html#client_body_timeout 60s http://nginx.org/en/docs/http/ngx_http_core_module.html#client_header_timeout 60s http://nginx.org/en/docs/http/ngx_http_core_module.html#lingering_timeout 5s
I am not suggesting that we change these timeouts, I'm just posting details for future reference. Changing these timeouts wouldn't help with the browser path anyway, because we can't control the timeouts on S3. It is a 4XX error, meaning client side, so fixing it on the client is the right approach.
I think the solution might end up with adjustment of both ends. 60s timeout for a body is VERY reasonable for typical deployments (there is no PUT of large files), but as we are file archive, and do expect large (megabytes) body I think it would only be sensible to boost it up. I would say 5 times (unless we see that server at times starts to consume too much RAM, do we have monitoring setup?). It would not be a complete solution BUT it would allow clients for larger chunk sizes transmission (or slower clients to not time out), thus minimizing # of API calls etc. I am not sure if I would approach adjusting girder client upload logic to try dynamic adjustment of the chunk size, but we definitely should keep it in mind while implementing upload through our API. Sure thing changes in nginx nohow would help with direct upload to S3, but let's mitigate one path at a time (unless we then see that we would get similar errors on girder -> s3 path).
It is a 4XX error, meaning client side,
Well - even though they are "Client errors" - mitigation of them is not as straightforward I would say. 403 could be due to incorrect file level permissions on the server side (can be fixed only on the server side). 404 - file was moved on the server, and nothing client can really do (besides being provided a corrected address). 405 (Method Not Allowed) - I think I saw those appearing until I configured web server correctly to fulfill the requests client was supposed to be able to do, and so on.
I am not sure if I would approach adjusting girder client upload logic
having said that I think that was a good idea to retry sendRestRequest
regardless, but probably only on a limited set of error codes (e.g. 408 Request Timeout, 409 Conflict), any other comes to mind? I will quickly add retry/logging there to see what we would be getting.
Sorry, I actually had wondered if pip did upgrade, but did not check. Now with the "latest" version, still a problem: https://gist.github.com/mhhennig/f7ffec71fd6140f111fa2ee0cc59299d I'm getting the impression the problem may be on my side, could this be the case? What causes these timeouts, it's the client, right? Could this have to do with the network config we have here in Informatics?
ok, I have pushed 29bd59ac66cd58b6e3a4de292ee44316f70ee8e8 into the same branch, so whenever @mhhennig has a chance to try that branch (it will be in combination with smaller chunk size) we will see the effect!
Also, now (as of 1d00ad0ee246cff0bcf83699839a5ad98b091405) we will be back to 64M chunk size by default but you could set desired one with env var DANDI_MAX_CHUNK_SIZE=8388608 (for 8M chunk).
I think those should be good in general to have so issued https://github.com/dandi/dandi-cli/pull/139 from that branch.
Re the most recent attempt by @mhhennig , here is with dtimes:
10732 2020-07-30 14:43:37,505 [ DEBUG] REST>: args=('POST', 'file/chunk?offset=8388608&uploadId=5f22ce7e385430deb68c9bb3', None) kwargs={'files': None, 'data': <girder_client._Progress
0 2020-07-30 14:43:48,237 [ DEBUG] REST<: {'_id': '5f22ce7e385430deb68c9bb3', 'assetstoreId': '5d9618588a600b0dcb25b4a6', 'behavior': 's3', 'created': '2020-07-30T13:43:26.526000+0
- {'upload': 0.056829614663500866}
11 2020-07-30 14:43:48,237 [ DEBUG] PROGRESS[sub-001/sub-001_ses-20200727T094620_ecephys.nwb]: done=16777216 %done=0.056829614663500866
10733 2020-07-30 14:43:48,248 [ DEBUG] REST>: args=('POST', 'file/chunk?offset=16777216&uploadId=5f22ce7e385430deb68c9bb3', None) kwargs={'files': None, 'data': <girder_client._Progres
1 2020-07-30 14:43:58,981 [ DEBUG] REST<: {'_id': '5f22ce7e385430deb68c9bb3', 'assetstoreId': '5d9618588a600b0dcb25b4a6', 'behavior': 's3', 'created': '2020-07-30T13:43:26.526000+0
- {'upload': 0.0852444219952513}
9 2020-07-30 14:43:58,982 [ DEBUG] PROGRESS[sub-001/sub-001_ses-20200727T094620_ecephys.nwb]: done=25165824 %done=0.0852444219952513
11443 2020-07-30 14:43:58,991 [ DEBUG] REST>: args=('POST', 'file/chunk?offset=25165824&uploadId=5f22ce7e385430deb68c9bb3', None) kwargs={'files': None, 'data': <girder_client._Progres
1 2020-07-30 14:44:10,434 [ DEBUG] REST<: {'_id': '5f22ce7e385430deb68c9bb3', 'assetstoreId': '5d9618588a600b0dcb25b4a6', 'behavior': 's3', 'created': '2020-07-30T13:43:26.526000+0
- {'upload': 0.11365922932700173}
4 2020-07-30 14:44:10,435 [ DEBUG] PROGRESS[sub-001/sub-001_ses-20200727T094620_ecephys.nwb]: done=33554432 %done=0.11365922932700173
953781 2020-07-30 14:44:10,439 [ DEBUG] REST>: args=('POST', 'file/chunk?offset=33554432&uploadId=5f22ce7e385430deb68c9bb3', None) kwargs={'files': None, 'data': <girder_client._Progres
----- 2020-07-30 15:00:04,220 [ DEBUG] Caught exception ('Connection aborted.', OSError("(110, 'ETIMEDOUT')"))%
which
and yes -- most likely it is some rate limiting policy on your network. I would have contacted admins and pointed them to this issue for more details. Meanwhile, if you would be so kind -- could you try to pip install --upgrade
again, set also DANDI_MAX_CHUNK_SIZE=8388608 (just in case or would be fun to see also without it) and see if up to 3 retries magically help us to go forward ;-)
Ok, here's a run with the latest changes, and the following command:
DANDI_LOG_GIRDER=1 DANDI_DEVEL=1 DANDI_MAX_CHUNK_SIZE=8388608 dandi -l DEBUG upload --devel-debug --existing force sub-001/sub-001_ses-20200727T094620_ecephys.nwb 2>&1 | tee /tmp/mydandi.log
Still no success: https://gist.github.com/mhhennig/f7ffec71fd6140f111fa2ee0cc59299d
I'll also check with our comp admin people...
I have pushed 2abe7fac29a2556e226bb1626c99bdd791c6dc47 with which we would retry on more exceptions. Would you be @mhhennig do install/rerun again? I still hope that we would get to a workable solution (while admin people react)
Just re-ran it, still breaks, but seems to have gotten a bit further this time (log in same gist).
Ha, this is a "new" one:
9230 2020-07-30 19:13:48,083 [ DEBUG] REST#1>: args=('POST', 'file/chunk?offset=285212672&uploadId=5f230859385430deb68c9bb8', None) kwargs={'files': None, 'data': <girder_client._Prog
1 2020-07-30 19:13:57,313 [ DEBUG] REST#1<: {'_id': '5f230859385430deb68c9bb8', 'assetstoreId': '5d9618588a600b0dcb25b4a6', 'behavior': 's3', 'created': '2020-07-30T17:50:17.064000
- {'upload': 0.9945182566112651}
4 2020-07-30 19:13:57,314 [ DEBUG] PROGRESS[sub-001/sub-001_ses-20200727T094620_ecephys.nwb]: done=293601280 %done=0.9945182566112651
10087 2020-07-30 19:13:57,318 [ DEBUG] REST#0>: args=('POST', 'file/chunk?offset=293601280&uploadId=5f230859385430deb68c9bb8', None) kwargs={'files': None, 'data': <girder_client._Prog
- Response text: {"message": "Server has received 292659200 bytes, but client sent offset 293601280.", "type": "rest"}
0 2020-07-30 19:14:07,405 [ DEBUG] REST#0: failed with HTTP error 400: POST https://girder.dandiarchive.org/api/v1/file/chunk?offset=293601280&uploadId=5f230859385430deb68c9bb8
----- 2020-07-30 19:14:07,405 [ DEBUG] Caught exception HTTP error 400: POST https://girder.dandiarchive.org/api/v1/file/chunk?offset=293601280&uploadId=5f230859385430deb68c9bb8
- Response text: {"message": "Server has received 292659200 bytes, but client sent offset 293601280.", "type": "rest"}%
which says that server did not receive a "part" (not the same as our chunk size) of the previous (retransmitted?) "body": 293601280 - 292659200 = 942080. I think I have provided a possible "fix" in 63e1350c67b6fb461a49fca3588a4b106616a39b which I just pushed. If that one doesn't work -- something might need to be done on girder side to accommodate retransmissions. @mhhennig - would it be too much to ask to try again? (any word from admins BTW?)
I'll re-try as many times as it takes :) Looking promising now, it's at 1.7% and running, I think this far I never got it previously.
At 6% now, I'll keep it running overnight...
woooohooo! That is promising!
BTW, if you are keeping that log, please share it afterwards (can just email to debian@onerussian.com) - I would like to see how many retries etc. Also, please do not upload other files right away -- we will need to check if integrity (checksum) is matching (after download). Then let's try one other big one without reducing chunk size and see how that goes... if not too bad -- then we could try without any special settings but I am afraid without nginx tune up for timeouts it would fail since will try to do multiple uploads in parallel
So, something changed, it completed although I'm not sure the upload actually worked. Sending the log via email.
PATH SIZE ERRORS UPLOAD STATUS MESSAGE
sub-001/sub-001_ses-20200727T094620_ecephys.nwb skipped Expecting value: line 1 column 1 (char 0)
Summary: 0 Bytes 444.0 kB/s 1 skipped 1 Expecting value: line 1 column 1 (char 0)
Update: data rates from the machine were very low (below a MByte/s) during the transfer, so there may be a problem on our side. Computing support is investigating...
Question from support: Is it an IPv4 or IPv6 connection? If the latter they suggest to test IPv4.
Question from support: Is it an IPv4 or IPv6 connection? If the latter they suggest to test IPv4.
AFAIK just good old v4, I do not even see a record for IPv6 for any of our servers:
$> nslookup girder.dandiarchive.org
Server: 192.168.1.1
Address: 192.168.1.1#53
Non-authoritative answer:
Name: girder.dandiarchive.org
Address: 3.19.164.171
As for speed -- what speeds do you attain while uploading to any other server? e.g. by running speedtest google provides (just search for speedtest)? What admins said would be an "expected" speed you should see?
re failed run... some bug somewhere. Would you be so kind to rerun while making sure you provide --devel-debug
option?
actually, ideally, let's run it in debug mode which would give us interactive debugger, so we could just interactively troubleshoot it when/if that issue occurs. Do smth like
DANDI_MAX_CHUNK_SIZE=8388608 DANDI_LOG_GIRDER=1 DANDI_DEVEL=1 dandi -l DEBUG --pdb upload -e force --validation skip --existing force --devel-debug PATHTOFILE
(no saving log into a file) so if exception happens, you would end up in pdb
and we could see details (better to work on that in slack more "interactively") but first commands to run would be l
to see surrounding code and p
to print some variables etc involved.
edit: removed -i local-docker
FTR: I have tried to emulate slow connection locally with
sudo tc qdisc add dev vetha3482b2 root tbf rate 1024kbit latency 50ms burst 1540
for both interfaces related to dandi_client and dandi_girder local development containers... upload speed crawled 0.5-1 Mbps (bits, not bytes! ;)) but did not trigger any retransmission...
an interesting observation in the log you shared is that it is only on the initial REST upload request it seems to fail and then proceeds normally without retransmissions... since backtrace above shows that it is coming from SSL layer, and unlike local using http main uses https, I suspect it might be somehow related to SSL... I will see if I could establish a dedicated "slow down" for connection going from my laptop to the main server so I could possibly trigger original issue myself...
FTR: https://github.com/thombashi/tcconfig seems to be the ultimate tool to assist with "emulation". I did sudo `which tcset` wlp82s0 --delay 100ms --network 3.19.164.171 --rate 8000Kbps
(tcset is python based, installed in a virtualenv thus the dance with which
)
Unfortunately, even after adding --loss 0.01%
I do not see the "desired" behavior of REST request errorring out. Increased to --delay 200ms --network 3.19.164.171 --rate 8000Kbps --loss 1%
and the same ...
So I am ready to blame (check) if it might be Python libraries etc. Would you be so kind to provide output of conda list --explicit
in your conda environment? may be that brings us "luck" ;)
Ok. Some unrelated packages here:
thanks! edited your comment to make those collapsable. Hint for future - use html:
<details>
<summary></summary>
```shell
HERE GOES LOTS OF TEXT
```
</details>
FWIW, I am sorry to report @mhhennig that you are too special! ;) With network throttling, in a similar conda env (+ plain pip install of dandi from git, the same girder-client version verified) -- I see no initial retransmission.
Having said that, I think that the last error we have observed (Expecting value: line 1 column 1 (char 0)
) is most likely our client code or pyout issue. So if you run with --devel-debug
and that --pdb
-- if you crash similarly it means that our code , if not - that could have been pyout ;) Are you running that reupload ATM? how is it going?
we are no longer using girder backend, and will stay hopeful that such an error no longer happens! ;)
From the slack
I have not analyzed yet anyhow: is it something we could mitigate on client side or it is the girder server component... @mgrauer - do we have access to some girder logs which could shine some more light on this particular case? (happened around
2020-07-28 14:28:50,777
timezone yet to clarify)