Open gmazzamuto opened 1 year ago
also pinging @AlmightyYakob as there may be both server and cli interactions at play.
could you please share the log file which should have been mentioned at the end of the process? May be would help to see what is not implemented and thus causing 501.
I can't see any meaningful information in the log file besides the snippets I have pasted above, I'm attaching it here anyways! 20230327134937Z-3822266.log.gz
I see two types of errors in the log file:
@gmazzamuto What type of filesystem are you uploading from? Do the errors happen again if you upload again?
The files are not modified during the upload. I am uploading from an NFS share. When I reuploaded, there were no errors. The upload I am doing right now is showing two 400 errors and one 501 error.
NFS can be fun, depending on the enthusiasm (or lack of such) of admins ;) At some point even, to reduce "fun", I just made it all plain sync
(read -- slow) to avoid any kind of surprises which I had of various kinds. How "recent" the data files you are uploading were created/modified in respect to their time point of upload? may be we should start recording mtime
of files we upload so if smth goes wrong we could check if mtime did not change but I am neither sure it is the whole situation here.
Hi Yarik, the files where created several hours before the upload. They are stored on a NAS server with RAID and accessed through an NFS share. Do you think the problems I am seeing are due to the network share? I've never noticed data corruption before.
During the last upload I got several errors, including a 501 but with a slightly different stack trace:
Traceback (most recent call last):
File "/opt/bin/dandi", line 8, in <module>
sys.exit(main())
File "/opt/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/opt/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/lib/python3.8/site-packages/click/decorators.py", line 38, in new_func
return f(get_current_context().obj, *args, **kwargs)
File "/opt/lib/python3.8/site-packages/dandi/cli/base.py", line 102, in wrapper
return f(*args, **kwargs)
File "/opt/lib/python3.8/site-packages/dandi/cli/cmd_upload.py", line 98, in upload
upload(
File "/opt/lib/python3.8/site-packages/dandi/upload.py", line 343, in upload
raise upload_err
File "/opt/lib/python3.8/site-packages/dandi/upload.py", line 240, in process_path
for r in dfile.iter_upload(
File "/opt/lib/python3.8/site-packages/dandi/files/zarr.py", line 470, in iter_upload
size = fut.result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 437, in result
return self.__get_result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/opt/lib/python3.8/site-packages/dandi/files/zarr.py", line 544, in _upload_zarr_file
storage_session.put(
File "/opt/lib/python3.8/site-packages/dandi/dandiapi.py", line 311, in put
return self.request("PUT", path, **kwargs)
File "/opt/lib/python3.8/site-packages/dandi/dandiapi.py", line 210, in request
for i, attempt in enumerate(
File "/opt/lib/python3.8/site-packages/tenacity/__init__.py", line 347, in __iter__
do = self.iter(retry_state=retry_state)
File "/opt/lib/python3.8/site-packages/tenacity/__init__.py", line 325, in iter
raise retry_exc.reraise()
File "/opt/lib/python3.8/site-packages/tenacity/__init__.py", line 158, in reraise
raise self.last_attempt.result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 437, in result
return self.__get_result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/opt/lib/python3.8/site-packages/dandi/dandiapi.py", line 240, in request
result.raise_for_status()
File "/opt/lib/python3.8/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 501 Server Error: Not Implemented for url: https://dandiarchive.s3.amazonaws.com/zarr/d5f67f3d-144c-447e-aae0-ca5b979c648d/0/1/22/58?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F
20230330%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230330T134214Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=e65355bb3725ca73f8d313d9e907ddbbeb00c3e7868d392b233b8eaf2eb7b1b5
@jwodder what about those "not implemented" 501s coming from AWS -- what are those exactly about and isn't it us who submit those requests?
@yarikoptic As before, it's the same problem as #1033.
Hi Yarik, the files where created several hours before the upload. They are stored on a NAS server with RAID and accessed through an NFS share. Do you think the problems I am seeing are due to the network share?
In that #1033 it also was NFS over beegfs, so so far it is some common aspect here. Do you know your underlying NAS server filesystem?
I've never noticed data corruption before.
it might be not necessarily "corruption" but some other odd behavior/delayed metadata propagation. E.g. in https://github.com/dandi/dandi-cli/issues/1033#issuecomment-1176447518 @jwodder hypothesizes that may be some inability to get file size reported forces requests to
sends a "Transfer-Encoding" (with value "chunked") when it's unable to determine the size of the request payload .. The only reason I can think of as to why getting the file size should fail would be due to some sort of filesystem hiccup.
During the last upload I got several errors, including a 501 but with a slightly different stack trace: ... requests.exceptions.HTTPError: 501 Server Error: Not Implemented for url: https://dandiarchive.s3.amazonaws.com/zarr/d5f67f3d-144c-447e-aae0-ca5b979c648d/0/1/22/58?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F 20230330%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230330T134214Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=e65355bb3725ca73f8d313d9e907ddbbeb00c3e7868d392b233b8eaf2eb7b1b5
@jwodder believes (and I trust him on that) that it is likely that issue of chunked encoding described in https://github.com/dandi/dandi-cli/issues/1033#issuecomment-1176447518 and cited above.
@jwodder - do you mean this https://github.com/psf/requests/blob/HEAD/requests/models.py#L549 or some other location? how could we instrument in our code to discover more about the situation better?
@yarikoptic That's one of the location involved, yes. What exactly do you want to discover?
@yarikoptic That's one of the location involved, yes. What exactly do you want to discover?
ultimately -- resolution to this issue so that there is no error during uploading. For that IMHO we need to troubleshoot further and instrument to gather
Found another old NFS related issue: https://github.com/dandi/dandi-cli/issues/764
@yarikoptic Could you be more precise about exactly what sort of instrumentation you want? Your first bullet point sounds like you want to detect & report when requests
fails to get a file's size, and your second bullet point also sounds like you want to detect & report when requests
fails to get a file's size (except that the second bullet says to do this by inspecting some exception, but the only exception dandi-cli sees is raised in response to S3 returning a 501 error). Am I misinterpreting what you're saying, or are you asking for the exact same thing twice with completely different words?
Your first bullet point sounds like you want to detect & report when
requests
fails to get a file's size
correct
and your second bullet point also ... Am I misinterpreting what you're saying, or are you asking for the exact same thing twice with completely different words?
in the 2nd point I
501
error from AWSrequests
fails to get the size if on retry requests
also fails to get the size. If we do not fail to get the size, but requests again leads us to 501 then it is either
requests
code which cause it to fail there.@yarikoptic Yes, we are retrying on the 501 errors. For example, grepping for one of the failed upload URLs from the logs in this comment gives:
2023-03-27T17:00:04+0200 [DEBUG ] dandi 3822266:140508050728704 PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:11+0200 [DEBUG ] urllib3.connectionpool 3822266:140508050728704 https://dandiarchive.s3.amazonaws.com:443 "PUT /zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228 HTTP/1.1" 500 None
2023-03-27T17:00:13+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:15+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:18+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:21+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:24+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:29+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:34+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:41+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:49+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:00:59+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:01:11+0200 [WARNING ] dandi 3822266:140508050728704 Retrying PUT https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
requests.exceptions.HTTPError: 501 Server Error: Not Implemented for url: https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
2023-03-27T17:01:12+0200 [DEBUG ] dandi 3822266:140509742302976 Error uploading zarr: HTTPError: 501 Server Error: Not Implemented for url: https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
requests.exceptions.HTTPError: 501 Server Error: Not Implemented for url: https://dandiarchive.s3.amazonaws.com/zarr/0205ef2e-71ba-4ba1-a4b3-489ee5efe02b/0/0/8/42?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUBRWC5GAEKH3223E%2F20230327%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20230327T145955Z&X-Amz-Expires=600&X-Amz-SignedHeaders=content-md5%3Bhost%3Bx-amz-acl&X-Amz-Signature=70c3f2c08fc29ea209c8a60fb366fa4619cb2d7190b9b5030c82552cad51a228
continued arguing that if we get to that 501 exception handling after initial error we must fail to get the size the way
requests
fails to get the size if on retryrequests
also fails to get the size.
So, if & when the 501 error occurs (only for the first time for a given uploaded file?), you want dandi-cli to try to get the filesize the same way requests
does it, and then what?
So, if & when the 501 error occurs (only for the first time for a given uploaded file?), you want dandi-cli to try to get the filesize the same way requests does it, and then what?
if we can get the file size there then it is one of those 3 cases I listed which could be at play here. Most likely imho is that it is some other cause than size getting. But to entirely rule it out -- we would need to overload
❯ grep 'def super' requests/utils.py
def super_len(o):
which is what is used by requests, and instrument to log the "body" (what file it is) and returned size. This way we would discover for sure if that is somehow size to blame. What if we start from there and add instrumentation of that function to be triggered by e.g. DANDI_DEVEL_INSTRUMENT_REQUESTS_SUPERLEN
env var?
@gmazzamuto would you be so kind to try with fresh 0.52.0 release which would log more information about failing requests -- may be it would give us more clue on what is going on.
@gmazzamuto , if you could install from GitHub master branch, and run with DANDI_DEVEL_INSTRUMENT_REQUESTS_SUPERLEN=1
environment variable set, would be even better and logs could help us to figure out more
Sure, I can give it a try. But I have completed my uploads for now, there are only some small files left (photos and json sidecar files). I guess it's more likely to happen with larger files, but I can try anyways
any odd behaviors you were (un)lucky to run into @gmazzamuto ?
Hi Yarik, I haven't had the chance to upload new data yet, I will try in the coming days then I'll let you know how it goes!
@gmazzamuto we are holding our breath! ;-)
Hi Yarik, I haven't forgotten about this, but I am just waiting for a bunch of new data to upload to see if the error can be triggered again. It should be ready soon
Hi Yarik, I am getting ready for new uploads next week. Are the instructions still valid? Install from GitHub master branch and run with DANDI_DEVEL_INSTRUMENT_REQUESTS_SUPERLEN=1
yes, besides that pypi release (since 0.53.0) would be as good too now.
@gmazzamuto did you get a chance to gather more debugging information on this issue?
Yes I have tried recently and I still got the same errors. I had to restart the upload a couple of times. I have the log file but it's really huge, so I haven't uploaded it. I will soon try again, maybe with a smaller payload
Hi folks,
I am still getting those errors. Here attached is a heavily redacted log file with many lines deleted, I've left the lines around where the error happened.
I've deleted all the lines of this kind:
The `/NIH/DANDI/000026/rawdata_biolab/I60/20220829_I60_49_LeftDet_638_RightDet_488/tiff_right/x_132.00000_y_099.60000_z_000.00000__cam_r.ome.tif` file was not matched by any regex schema entry.
Estimating digests for /NIH/DANDI/000026/sub-I59/ses-SPIM/micr/sub-I59_ses-SPIM_sample-BrocaAreaS27_stain-Calretinin_SPIM.ome.zarr/0/0/84/55
and the successful transfers.
I found the problem:
2023-11-22T17:44:16+0100 [DEBUG ] dandi 639852:140007712995072 requests.utils.super_len() reported 0 for <_io.BufferedReader name='/NIH/DANDI/000026/sub-I59/ses-SPIM/micr/sub-I59_ses-SPIM_sample-BrocaAreaS28_stain-Somatostatin_SPIM.ome.zarr/0/1/6/12'>
If super_len()
returns 0, requests
will use "chunked" encoding, which AWS apparently doesn't support. It's my understanding that uploading an empty file to AWS (and thus to Dandi Archive) isn't something that can be done in the first place, so (aside from the UX around the error message) the upload is correct to error.
@gmazzamuto can you confirm that /NIH/DANDI/000026/sub-I59/ses-SPIM/micr/sub-I59_ses-SPIM_sample-BrocaAreaS28_stain-Somatostatin_SPIM.ome.zarr/0/1/6/12
is an empty (0 length) file?
It's my understanding that uploading an empty file to AWS (and thus to Dandi Archive) isn't something that can be done in the first place
What do you mean exactly? For DANDI we indeed disallow empty blobs, but S3 does support empty files [*] so if zarr needs (yet to be checked) an empty file we should be able to do it on S3.
Since I am not an expert in zarr -- does having an empty file within zarr makes sense. @satra do you know? so far chatgpt says "No" to me, but there could be empty folders! (for empty datasets, not sure if they could safely be omitted but we can't have empty folder on S3)
if it does -- I guess we would need to make sure we could "upload" 0-length file to S3 (might be some custom/touch
call without actual content transfer since none to be done). If it doesn't -- we should add to zarr validation and fail informatively if such file is found within a zarr.
[*] https://aws.amazon.com/s3/faqs/ says "Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 TB. "
Previous issue on uploading empty files to the Archive: https://github.com/dandi/dandi-archive/issues/168
@yarikoptic No, it's not an empty file. It's 965KiB. Indeed, I reuploaded the whole .ome.zarr that errored and now it worked.
Previous issue on uploading empty files to the Archive: dandi/dandi-archive#168
yeap, no blobs of size 0 are still allowed although indeed it is somewhat of a superficial restriction. And apparently the case here is also that we do not have a 0-length file, but rather that requests.utils.super_len()
managed to report 0
on the file which is not empty. From looking at the function and https://github.com/psf/requests/blob/main/src/requests/utils.py#L143
try:
fileno = o.fileno()
except (io.UnsupportedOperation, AttributeError):
# AttributeError is a surprising exception, seeing as how we've just checked
# that `hasattr(o, 'fileno')`. It happens for objects obtained via
# `Tarfile.extractfile()`, per issue 5229.
pass
else:
total_length = os.fstat(fileno).st_size
it is not 100% clear on either it was some io.UnsupportedOperation for some reason or indeed os.fstat(fileno).st_size
returned 0.
I found nothing in requests to mention NFS, but I do know that "NFS is special" and all kinds of oddities could happen there.
@gmazzamuto -- would you mind sharing output of
stat /NIH/DANDI/000026/sub-I59/ses-SPIM/micr/sub-I59_ses-SPIM_sample-BrocaAreaS28_stain-Somatostatin_SPIM.ome.zarr/0/1/6/12
?
@jwodder -- any ideas? if not -- we should instrument your instrumentation further to clarify where the gotcha and either there is some sensible workaround (e.g. sleep). e.g. instead of just printing when super_len
returns 0 - do os.fstat on that .fileno
and then possibly even smth as calling stat
on the .name
from outside the python and checking that information -- if both are 0s, then "all cool", must be empty on that round. But if not -- we caught NFS/Python interplay and might be worth reporting to requests depending on what we find. Also do loop few times with a sleep to ensure that value doesn't change within e.g. 5 seconds, if changes -- just NFS to blame, and for us to workaround unless we could unveil some indicator -- please log full output of fstat
in such cases, may be mtime/ctime or some other attribute would also be degenerate.
Here is the output of stat:
File: /NIH/DANDI/000026/sub-I59/ses-SPIM/micr/sub-I59_ses-SPIM_sample-BrocaAreaS28_stain-Somatostatin_SPIM.ome.zarr/0/1/6/12
Size: 987680 Blocks: 1936 IO Block: 1048576 regular file
Device: 3dh/61d Inode: 244535389962 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/user) Gid: ( 1000/group)
Access: 2023-11-22 17:01:35.752336409 +0100
Modify: 2023-11-22 10:14:49.966501429 +0100
Change: 2023-11-22 10:14:49.966501429 +0100
Birth: -
@yarikoptic
possibly even smth as calling
stat
on the.name
from outside the python and checking that information
Do you mean invoking an external program that does stat
? I don't believe there's a portable way to do that, and I'd rather not blindly try to figure out how to do such a thing in Windows.
But if not -- we caught NFS/Python interplay and might be worth reporting to requests depending on what we find.
I doubt the requests
maintainers are going to be willing or able to do anything about NFS being a liar.
@yarikoptic
possibly even smth as calling
stat
on the.name
from outside the python and checking that informationDo you mean invoking an external program that does
stat
? I don't believe there's a portable way to do that, and I'd rather not blindly try to figure out how to do such a thing in Windows.
AFAIK there no NFS on windows we need to troubleshoot here, so we should be fine. Sure thing if you see a better, non-stat way, ok with me too. The idea was to get "third-party non Python assessment" since after all Python is also not god's given and I personally do not know if it would lead to the same syscall as of stat
and no other conditioning.
I doubt the requests maintainers are going to be willing or able to do anything about NFS being a liar.
well, it might at least be worth raising their awareness that NFS is a liar and workarounds might be needed. But first we really need to catch it "red handed".
@yarikoptic Problem: I don't believe there's a portable way to call stat
via an external program on Unix either (unless you want to invoke another programming language like Perl). The best candidate — the stat(1)
command — isn't part of POSIX, so the GNU version (used on Linux) and the BSD version (used on macOS) emit different formats and take different options. Also, I strongly suspect that stat(1)
, Perl, and Python all call stat(2)
.
@yarikoptic Setting aside the external program issue for now, what exactly is the flow you want to be added around super_len()
? Your comment implies the following:
super_len()
returns 0 on a file-like object:
os.stat()
on the file pathos.fstat()
on the file object's filenorequests
?super_len()
raises an exception?Is this the sort of thing you want?
Since we do not expect 0 length files at all:
super_len()
returns 0 on a file-like object:
super_len()
on the file pathos.stat()
on the file pathos.fstat()
on the file object's filenostat
X is not after Y seconds)super_len()
raises an exception?
@gmazzamuto when will you be uploading again? we now have merged #1370 so if you would use master version of dandi-cli with DANDI_DEVEL_INSTRUMENT_REQUESTS_SUPERLEN=1
we should do even more checks/logging at debug level to hopefully pin down erroneous behavior of NFS here.
Here is the latest log (again, where I've deleted many lines) with several exceptions of different kinds.
@gmazzamuto Are you sure you were using commit b9a1099 of dandi-cli? I see a log message about a zero-length Zarr file but none of the messages added in #1370.
no sorry, my fault: I saw a new available version of dandi-cli and installed that without checking if it contained the relevant commit. I'll send the log using the correct version of the client
Hi all, here is the latest log:
I see the following problems in the log:
2023-12-02T02:15:59+0100 [DEBUG ] dandi 3030432:140064092833536 requests.utils.super_len() reported 0 for <_io.BufferedReader name='/NIH/DANDI/000026/sub-I62/ses-SPIM/micr/sub-I62_ses-SPIM_sample-BrocaAreaS13_stain-Calretinin_SPIM.ome.zarr/0/1/82/89'>
2023-12-02T02:15:59+0100 [DEBUG ] dandi 3030432:140064092833536 - Size of 0 is suspicious; double-checking that NFS isn't lying
2023-12-02T02:15:59+0100 [DEBUG ] dandi 3030432:140064092833536 - stat('/NIH/DANDI/000026/sub-I62/ses-SPIM/micr/sub-I62_ses-SPIM_sample-BrocaAreaS13_stain-Calretinin_SPIM.ome.zarr/0/1/82/89') = os.stat_result(st_mode=33188, st_ino=248398250076, st_dev=61, st_nlink=1, st_uid=5000, st_gid=7000, st_size=312461, st_atime=1701477007, st_mtime=1700979160, st_ctime=1700979160)
2023-12-02T02:15:59+0100 [DEBUG ] dandi 3030432:140064092833536 - fstat('/NIH/DANDI/000026/sub-I62/ses-SPIM/micr/sub-I62_ses-SPIM_sample-BrocaAreaS13_stain-Calretinin_SPIM.ome.zarr/0/1/82/89') = os.stat_result(st_mode=33188, st_ino=248398250076, st_dev=61, st_nlink=1, st_uid=5000, st_gid=7000, st_size=312461, st_atime=1701477007, st_mtime=1700979160, st_ctime=1700979160)
RuntimeError: requests.utils.super_len() reported size of 0 for '/NIH/DANDI/000026/sub-I62/ses-SPIM/micr/sub-I62_ses-SPIM_sample-BrocaAreaS13_stain-Calretinin_SPIM.ome.zarr/0/1/82/89', but os.stat() reported size 312461 bytes 1 tries later
2023-12-02T07:38:21+0100 [DEBUG ] dandi 3030432:140061148436224 requests.utils.super_len() reported 0 for <_io.BufferedReader name='/NIH/DANDI/000026/sub-I62/ses-SPIM/micr/sub-I62_ses-SPIM_sample-BrocaAreaS17_stain-NeuN_SPIM.ome.zarr/0/1/32/51'>
2023-12-02T07:38:21+0100 [DEBUG ] dandi 3030432:140061148436224 - Size of 0 is suspicious; double-checking that NFS isn't lying
2023-12-02T07:38:21+0100 [DEBUG ] dandi 3030432:140061148436224 - stat('/NIH/DANDI/000026/sub-I62/ses-SPIM/micr/sub-I62_ses-SPIM_sample-BrocaAreaS17_stain-NeuN_SPIM.ome.zarr/0/1/32/51') = os.stat_result(st_mode=33188, st_ino=248625074703, st_dev=61, st_nlink=1, st_uid=5000, st_gid=7000, st_size=1355361, st_atime=1701494494, st_mtime=1700989187, st_ctime=1700989187)
2023-12-02T07:38:21+0100 [DEBUG ] dandi 3030432:140061148436224 - fstat('/NIH/DANDI/000026/sub-I62/ses-SPIM/micr/sub-I62_ses-SPIM_sample-BrocaAreaS17_stain-NeuN_SPIM.ome.zarr/0/1/32/51') = os.stat_result(st_mode=33188, st_ino=248625074703, st_dev=61, st_nlink=1, st_uid=5000, st_gid=7000, st_size=1355361, st_atime=1701494494, st_mtime=1700989187, st_ctime=1700989187)
RuntimeError: requests.utils.super_len() reported size of 0 for '/NIH/DANDI/000026/sub-I62/ses-SPIM/micr/sub-I62_ses-SPIM_sample-BrocaAreaS17_stain-NeuN_SPIM.ome.zarr/0/1/32/51', but os.stat() reported size 1355361 bytes 1 tries later
Conclusion: NFS is a liar.
"coolio!" another question before we proceed: @jwodder do you think it is pertinent to only size 0 or might be for other occasions that NFS reports wrong size first -- would some check/code fail if file size changes during upload there from the moment of initial call to super_len
?
@yarikoptic
would some check/code fail if file size changes during upload there from the moment of initial call to super_len?
That depends on how the S3 server is implemented. The HTTP standard doesn't seem to specify a behavior when a request has an inaccurate Content-Length
, so the server could react however it likes, such as by storing a file with a size that matches the Content-Length
but not the uploaded data.
@jwodder
iter_upload
?) a check that file size didn't check from the point when it was initially requested to be uploaded. If changed -- raise an informative exception and blame the filesystem and/or user for having file changed under our "feet"@gmazzamuto could you share details of kernel version (uname -a
) and the mount options (output of mount
grepped for that location) so we have a record of what scenario could potentially lead to such odd behavior. Ideally someone should report/clear it up with NFS (kernel) developers.
@yarikoptic
let's add (at the end of iter_upload?) a check that file size didn't [change] from the point when it was initially requested to be uploaded.
S3 already checks that the uploaded files match the MD5 checksums that the client provided before starting the upload. If a Content-Length
were to get set to an inaccurate value and S3 honored it, this check would fail, and the upload would fail with an HTTP error.
for files of size 0 -- add a loop alike we have but re-checking at smaller intervals and up to e.g. 2 seconds and if changes, log a warning and proceed with that new size.
Is this a change to the current loop when spying on super_len()
or is this a new loop to be added somewhere else (and, if so, where)?
Dear all,
I am seeing errors while uploading some Zarr files:
Here are some (slightly redacted) excerpts from the log:
Hope this helps! Giacomo