Closed emilyasterjones closed 2 years ago
that doesn't sound good. What version of the dandi-cli? (0.31.0? want to make sure that we aren't chasing ghosts)
did you provide -J|--jobs
argument?
may be some thread unsafe code @jwodder ?
Yep, I'm using the current dandi release (installed via pip). I didn't provide the --jonbs
argument, just followed the handbook exactly
My initial guess is that the part["size"]
is invalid, which would mean the API server is returning bogus upload schematics, but such an error would likely result in a ValueError
or TypeError
instead of an OSError
. The only thing I can think of that might cause an OSError
there would be the file descriptor being prematurely closed, but I don't see how that could happen. Also note that lack of thread safety should be minimized for the specific line in question, as it's wrapped in a thread lock.
All the files with this error also generate a DEBUG line in the log that it failed to compute digest. An example:
[DEBUG ] pyout.interface 55364:66748 Duplicating line 31 with
{'status': 'skipped', 'message': 'failed to compute digest: [Errno 22] Invalid argument',
'path': 'sub-Aphid\\sub-Aphid_ses-Aphid07-HomeCage_behavior+ecephys.nwb'}
This debug message also appears for several other files which failed to upload, but which didn't generate an ERROR line in the log. There are several other similar DEBUG lines in the log that all convey the same information ([Errno 22] Invalid argument
), this I've just included as an example since its the only one that lists a filename.
I think this might be related to #836. I was copying files from the original to the organized upload folder and renaming them to get around the organize
bug, and decided to do the same for all the files that had this upload error as well (even though these files passed validation even after being organized). Those files are now uploading properly.
I wonder if somehow Windows doesn't tolerate +
in the filename somehow or may be a matter of shortened vs long filenames... but I have tried in an anaconda shell and this worked fine:
(base) C:\Users\DataLad\tmp>python -c "print(open('sub-Aphid\\sub-Aphid_ses-Aphid07-HomeCage_behavior+ecephys.nwb').read())"
"123"
re organize
- did it use symlinks somehow? can you just open that file somehow otherwise - e.g.
python -c "print(open(r'sub-Aphid\\sub-Aphid_ses-Aphid07-HomeCage_behavior+ecephys.nwb').read()[:2])"
?
Update from the most recent upload attempt: files are still failing to upload, but different subsets of files are failing with the original error or failing with the DEBUG failure I posted in my comment. The vast majority of files to appear to have this bug fixed by copying the files from the original folder to the organized one and renaming them, so this does seem to boil down to an organize
issue.
I am using symlinks. Specifically, I have Anaconda prompt running on a Windows 10 machine, pwd
is a mapped network drive Z: which points to a server where my data is stored. So my workflow is:
Z:/DREADDs_NWB/000165> dandi validate Z:/DREADDs_NWB/000165
Z:/DREADDs_NWB/000165> dandi upload
I am using symlinks.
hm, on windows... (might explain why copying resolves the issue). mapped Windows drive or some other beast (I still wonder how a rare to windows world symlinks come to life there)?
BTW, organize
has an option to copy files as well.
Z:/DREADDs_NWB/000165> dandi validate Z:/DREADDs_NWB/000165 Z:/DREADDs_NWB/000165> dandi upload
what does dandi digest sub-Aphid\sub-Aphid_ses-Aphid07-HomeCage_behavior+ecephys.nwb
says?
Running dandi digest
outputs no errors, just returns a long hex value. Log output also shows no errors.
2021-11-17T11:08:36-0800 [INFO ] dandi 55028:41224 dandi v0.31.0, hdmf v3.1.1, pynwb v2.0.0, h5py v3.5.0
2021-11-17T11:08:36-0800 [INFO ] dandi 55028:41224 sys.argv = ['C:\\Users\\emily.jones\\Anaconda3\\envs\\nwb\\Scripts\\dandi', 'digest', 'sub-Honeybee/sub-Honeybee_ses-Honeybee01-HomeCage_behavior+ecephys.nwb']
2021-11-17T11:08:36-0800 [INFO ] dandi 55028:41224 os.getcwd() = Z:\DREADDs_NWB\000165
2021-11-17T11:08:36-0800 [DEBUG ] urllib3.connectionpool 55028:41224 Starting new HTTPS connection (1): rig.mit.edu:443
2021-11-17T11:08:37-0800 [DEBUG ] urllib3.connectionpool 55028:41224 https://rig.mit.edu:443 "GET /et/projects/dandi/dandi-cli HTTP/1.1" 200 471
2021-11-17T11:08:37-0800 [DEBUG ] dandi 55028:41224 Could not check dandi/dandi-cli for version updates: __init__() got an unexpected keyword argument 'encoding'
2021-11-17T11:08:38-0800 [DEBUG ] fscacher.cache 55028:41224 Calling memoized version of <function get_digest at 0x000001F0A2A2ECA0> for \\hub.gladstone.internal\huanglab-lfp\Emily\DREADDs_NWB\000165\sub-Honeybee\sub-Honeybee_ses-Honeybee01-HomeCage_behavior+ecephys.nwb
2021-11-17T11:08:38-0800 [DEBUG ] fscacher.cache 55028:41224 Running original <function get_digest at 0x000001F0A2A2ECA0> on '\\\\hub.gladstone.internal\\huanglab-lfp\\Emily\\DREADDs_NWB\\000165\\sub-Honeybee\\sub-Honeybee_ses-Honeybee01-HomeCage_behavior+ecephys.nwb'
2021-11-17T11:08:38-0800 [DEBUG ] fscacher.cache 55028:41224 Calling memoized version of <function get_dandietag at 0x000001F0A2A34280> for \\hub.gladstone.internal\huanglab-lfp\Emily\DREADDs_NWB\000165\sub-Honeybee\sub-Honeybee_ses-Honeybee01-HomeCage_behavior+ecephys.nwb
2021-11-17T11:08:38-0800 [DEBUG ] fscacher.cache 55028:41224 Running original <function get_dandietag at 0x000001F0A2A34280> on '\\\\hub.gladstone.internal\\huanglab-lfp\\Emily\\DREADDs_NWB\\000165\\sub-Honeybee\\sub-Honeybee_ses-Honeybee01-HomeCage_behavior+ecephys.nwb'
2021-11-17T11:09:49-0800 [INFO ] dandi 55028:41224 Logs saved in C:\Users\emily.jones\AppData\Local\dandi\dandi-cli\Logs\20211117190836Z-55028.log
Note that I ran this on a different session which just failed to upload with the failed to compute digest
error. For reasons I don't understand, Aphid07-HomeCage uploaded with no issues on this iteration. I didn't change anything, it just uploaded fine.
heh, "mystery mystery". I wonder if there is a chance to blame the mount (filesystem)/unstable network? ;-)
BTW -- could it have been that there were two processes (possibly on some other box where this volume originates) modifying that dandiset?
I really do not throw away an idea that it might be a network issue since it gets saturated with both incoming (to fetch for the mount) and outgoing (to upload) traffic.
another relevant(ish) hit is https://stackoverflow.com/questions/48122798/oserror-errno-22-invalid-argument-when-reading-a-huge-file -- but there it talks about reading one large chunk of data at once, as if part['size']
here was "too large" (more than 2-4 GB according to the post). And IIRC our largest part size is 5GB. How large are those files? (hitting the above is unlikely but who knows)
that could explain may be original `
Definitely possible it's a filesystem issue, thought I've never had any issues with this filesystem over the 7 years I've used it. I ran all the NWB file generation on it without issue and have uploaded larger datasets to CRCNS directly from it without any problem.
There should only have been 1 process modifying the dataset at a time. I'm the only person working on this dataset, and I only have one Conda shell open with a single process running. No one else is modifying the dataset either through dandi or otherwise.
The files are 100-250MB, so not very large
Rerunning dandi upload
without changing anything does upload a few more files, which could point to a network issue. I'm not having any issues with network connectivity outside of that dandi process. What I don't understand is why I'm getting 2 separate issues ([Errno 22] Invalid argument
and failed to compute digest
), neither of which say anything about a network connection.
well, since it is a (network) filesystem - nothing can/should know about network. But indeed why errors are not really about filesystem -- that remains a puzzle. But we also derailed from "symlinks" aspect. What if you have proper copies -- does any of those happen?
AFAIK symlink support in windows is ... peculiar (IIRC requires admin privileges to enable etc), and I think you are the first person I see using them on windows. So I wonder if may be something "symlinks on Windows related"?
quick google gives me https://www.py4u.net/discuss/176524 . wild guess - /
in symlink path? is there way to see where symlink points on windows?
Not sure what you mean by proper copies?
I eventually got everything to upload simply by running the upload command over and over, uploading a subset of the error-generating files each time, not changing anything between runs.
I can't replicate this bug anymore because everything is uploaded, and even when it was generating errors it was a random subset of files so I don't know what was causing the failures.
Here are all the steps I did:
dandi validate
on the original NWB files, then dandi download <archive>
, cd to the newly created archive directory, then `dandi organize dandi upload
again would upload a subset of those files without anything changing between runs. After running dandi upload
6 times, managed to get all the data uploaded. eh, so we would let the bugs prosper :-/ would you be so kind to do smth like grep "setting files_mode" C:\Users\emily.jones\AppData\Local\dandi\dandi-cli\Logs\20211117*
-- I still desire confirmation that mode was set to symlink.
That returns no results. None of my log files, from uploading or validating, contain anything resembling it (setting file, setting file_mode, files_mode all return nothing).
Github won't allow me to upload the log files, so here they are on a Google drive:
That returns no results. None of my log files, from uploading or validating, contain anything resembling it (setting file, setting file_mode, files_mode all return nothing).
I guess we do not safe log to file for some other commands (like organize
) -- filed https://github.com/dandi/dandi-cli/issues/838 .
@emilyasterjones did you have a chance/need to upload since back then, and either you encountered similar issues?
I am afraid the mystery would remain for now, until/if someone would run into a similar odd behavior. Meanwhile I will close this issue but feel free to reopen if issue persists.
In attempting to upload files, a seemingly random subset (76 files of 510) are failing to upload. The files are not open in any other software nor are they empty. They all have this error message in the log file on upload:
Re-running the upload lead to successfully uploading 11 of the 76 failed files, despite nothing changing between the two upload attempts. If it were simply a network connection error, I would expect more than ~14% of the failed files to upload successfully on the second attempt.