dandi / dandi-cli

DANDI command line client to facilitate common operations
https://dandi.readthedocs.io/
Apache License 2.0
22 stars 27 forks source link

`dandi upload` fails on random subset of files #837

Closed emilyasterjones closed 2 years ago

emilyasterjones commented 3 years ago

In attempting to upload files, a seemingly random subset (76 files of 510) are failing to upload. The files are not open in any other software nor are they empty. They all have this error message in the log file on upload:

2021-11-16T08:06:58-0800 [ERROR   ] dandi 60872:59432 Error uploading sub-Adobo/sub-Adobo_ses-Adobo02-LinearTrack_behavior+ecephys.nwb:
Traceback (most recent call last):
  File "C:\Users\emily.jones\Anaconda3\envs\nwb\lib\site-packages\dandi\upload.py", line 262, in process_path
    for r in remote_dandiset.iter_upload_raw_asset(
  File "C:\Users\emily.jones\Anaconda3\envs\nwb\lib\site-packages\dandi\dandiapi.py", line 1120, in iter_upload_raw_asset
    out_part = fut.result()
  File "C:\Users\emily.jones\Anaconda3\envs\nwb\lib\concurrent\futures\_base.py", line 438, in result
    return self.__get_result()
  File "C:\Users\emily.jones\Anaconda3\envs\nwb\lib\concurrent\futures\_base.py", line 390, in __get_result
    raise self._exception
  File "C:\Users\emily.jones\Anaconda3\envs\nwb\lib\concurrent\futures\thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "C:\Users\emily.jones\Anaconda3\envs\nwb\lib\site-packages\dandi\dandiapi.py", line 1484, in _upload_part
    chunk = fp.read(part["size"])
OSError: [Errno 22] Invalid argument

Re-running the upload lead to successfully uploading 11 of the 76 failed files, despite nothing changing between the two upload attempts. If it were simply a network connection error, I would expect more than ~14% of the failed files to upload successfully on the second attempt.

yarikoptic commented 3 years ago

that doesn't sound good. What version of the dandi-cli? (0.31.0? want to make sure that we aren't chasing ghosts)

did you provide -J|--jobs argument?

may be some thread unsafe code @jwodder ?

emilyasterjones commented 3 years ago

Yep, I'm using the current dandi release (installed via pip). I didn't provide the --jonbs argument, just followed the handbook exactly

jwodder commented 3 years ago

My initial guess is that the part["size"] is invalid, which would mean the API server is returning bogus upload schematics, but such an error would likely result in a ValueError or TypeError instead of an OSError. The only thing I can think of that might cause an OSError there would be the file descriptor being prematurely closed, but I don't see how that could happen. Also note that lack of thread safety should be minimized for the specific line in question, as it's wrapped in a thread lock.

emilyasterjones commented 3 years ago

All the files with this error also generate a DEBUG line in the log that it failed to compute digest. An example:

[DEBUG   ] pyout.interface 55364:66748 Duplicating line 31 with
 {'status': 'skipped', 'message': 'failed to compute digest: [Errno 22] Invalid argument', 
'path': 'sub-Aphid\\sub-Aphid_ses-Aphid07-HomeCage_behavior+ecephys.nwb'}

This debug message also appears for several other files which failed to upload, but which didn't generate an ERROR line in the log. There are several other similar DEBUG lines in the log that all convey the same information ([Errno 22] Invalid argument), this I've just included as an example since its the only one that lists a filename.

emilyasterjones commented 3 years ago

I think this might be related to #836. I was copying files from the original to the organized upload folder and renaming them to get around the organize bug, and decided to do the same for all the files that had this upload error as well (even though these files passed validation even after being organized). Those files are now uploading properly.

yarikoptic commented 3 years ago

I wonder if somehow Windows doesn't tolerate + in the filename somehow or may be a matter of shortened vs long filenames... but I have tried in an anaconda shell and this worked fine:

(base) C:\Users\DataLad\tmp>python -c "print(open('sub-Aphid\\sub-Aphid_ses-Aphid07-HomeCage_behavior+ecephys.nwb').read())" 
"123"   

re organize - did it use symlinks somehow? can you just open that file somehow otherwise - e.g.

python -c "print(open(r'sub-Aphid\\sub-Aphid_ses-Aphid07-HomeCage_behavior+ecephys.nwb').read()[:2])"                                                                                                                         

?

emilyasterjones commented 2 years ago

Update from the most recent upload attempt: files are still failing to upload, but different subsets of files are failing with the original error or failing with the DEBUG failure I posted in my comment. The vast majority of files to appear to have this bug fixed by copying the files from the original folder to the organized one and renaming them, so this does seem to boil down to an organize issue.

emilyasterjones commented 2 years ago

I am using symlinks. Specifically, I have Anaconda prompt running on a Windows 10 machine, pwd is a mapped network drive Z: which points to a server where my data is stored. So my workflow is:

Z:/DREADDs_NWB/000165> dandi validate Z:/DREADDs_NWB/000165
Z:/DREADDs_NWB/000165> dandi upload
yarikoptic commented 2 years ago

I am using symlinks.

hm, on windows... (might explain why copying resolves the issue). mapped Windows drive or some other beast (I still wonder how a rare to windows world symlinks come to life there)? BTW, organize has an option to copy files as well.

yarikoptic commented 2 years ago

Z:/DREADDs_NWB/000165> dandi validate Z:/DREADDs_NWB/000165 Z:/DREADDs_NWB/000165> dandi upload

what does dandi digest sub-Aphid\sub-Aphid_ses-Aphid07-HomeCage_behavior+ecephys.nwb says?

emilyasterjones commented 2 years ago

Running dandi digest outputs no errors, just returns a long hex value. Log output also shows no errors.

2021-11-17T11:08:36-0800 [INFO    ] dandi 55028:41224 dandi v0.31.0, hdmf v3.1.1, pynwb v2.0.0, h5py v3.5.0
2021-11-17T11:08:36-0800 [INFO    ] dandi 55028:41224 sys.argv = ['C:\\Users\\emily.jones\\Anaconda3\\envs\\nwb\\Scripts\\dandi', 'digest', 'sub-Honeybee/sub-Honeybee_ses-Honeybee01-HomeCage_behavior+ecephys.nwb']
2021-11-17T11:08:36-0800 [INFO    ] dandi 55028:41224 os.getcwd() = Z:\DREADDs_NWB\000165
2021-11-17T11:08:36-0800 [DEBUG   ] urllib3.connectionpool 55028:41224 Starting new HTTPS connection (1): rig.mit.edu:443
2021-11-17T11:08:37-0800 [DEBUG   ] urllib3.connectionpool 55028:41224 https://rig.mit.edu:443 "GET /et/projects/dandi/dandi-cli HTTP/1.1" 200 471
2021-11-17T11:08:37-0800 [DEBUG   ] dandi 55028:41224 Could not check dandi/dandi-cli for version updates: __init__() got an unexpected keyword argument 'encoding'
2021-11-17T11:08:38-0800 [DEBUG   ] fscacher.cache 55028:41224 Calling memoized version of <function get_digest at 0x000001F0A2A2ECA0> for \\hub.gladstone.internal\huanglab-lfp\Emily\DREADDs_NWB\000165\sub-Honeybee\sub-Honeybee_ses-Honeybee01-HomeCage_behavior+ecephys.nwb
2021-11-17T11:08:38-0800 [DEBUG   ] fscacher.cache 55028:41224 Running original <function get_digest at 0x000001F0A2A2ECA0> on '\\\\hub.gladstone.internal\\huanglab-lfp\\Emily\\DREADDs_NWB\\000165\\sub-Honeybee\\sub-Honeybee_ses-Honeybee01-HomeCage_behavior+ecephys.nwb'
2021-11-17T11:08:38-0800 [DEBUG   ] fscacher.cache 55028:41224 Calling memoized version of <function get_dandietag at 0x000001F0A2A34280> for \\hub.gladstone.internal\huanglab-lfp\Emily\DREADDs_NWB\000165\sub-Honeybee\sub-Honeybee_ses-Honeybee01-HomeCage_behavior+ecephys.nwb
2021-11-17T11:08:38-0800 [DEBUG   ] fscacher.cache 55028:41224 Running original <function get_dandietag at 0x000001F0A2A34280> on '\\\\hub.gladstone.internal\\huanglab-lfp\\Emily\\DREADDs_NWB\\000165\\sub-Honeybee\\sub-Honeybee_ses-Honeybee01-HomeCage_behavior+ecephys.nwb'
2021-11-17T11:09:49-0800 [INFO    ] dandi 55028:41224 Logs saved in C:\Users\emily.jones\AppData\Local\dandi\dandi-cli\Logs\20211117190836Z-55028.log

Note that I ran this on a different session which just failed to upload with the failed to compute digest error. For reasons I don't understand, Aphid07-HomeCage uploaded with no issues on this iteration. I didn't change anything, it just uploaded fine.

yarikoptic commented 2 years ago

heh, "mystery mystery". I wonder if there is a chance to blame the mount (filesystem)/unstable network? ;-)

yarikoptic commented 2 years ago

BTW -- could it have been that there were two processes (possibly on some other box where this volume originates) modifying that dandiset?

I really do not throw away an idea that it might be a network issue since it gets saturated with both incoming (to fetch for the mount) and outgoing (to upload) traffic.

another relevant(ish) hit is https://stackoverflow.com/questions/48122798/oserror-errno-22-invalid-argument-when-reading-a-huge-file -- but there it talks about reading one large chunk of data at once, as if part['size'] here was "too large" (more than 2-4 GB according to the post). And IIRC our largest part size is 5GB. How large are those files? (hitting the above is unlikely but who knows) that could explain may be original `

emilyasterjones commented 2 years ago

Definitely possible it's a filesystem issue, thought I've never had any issues with this filesystem over the 7 years I've used it. I ran all the NWB file generation on it without issue and have uploaded larger datasets to CRCNS directly from it without any problem.

emilyasterjones commented 2 years ago

There should only have been 1 process modifying the dataset at a time. I'm the only person working on this dataset, and I only have one Conda shell open with a single process running. No one else is modifying the dataset either through dandi or otherwise.

The files are 100-250MB, so not very large

emilyasterjones commented 2 years ago

Rerunning dandi upload without changing anything does upload a few more files, which could point to a network issue. I'm not having any issues with network connectivity outside of that dandi process. What I don't understand is why I'm getting 2 separate issues ([Errno 22] Invalid argument and failed to compute digest), neither of which say anything about a network connection.

yarikoptic commented 2 years ago

well, since it is a (network) filesystem - nothing can/should know about network. But indeed why errors are not really about filesystem -- that remains a puzzle. But we also derailed from "symlinks" aspect. What if you have proper copies -- does any of those happen?

AFAIK symlink support in windows is ... peculiar (IIRC requires admin privileges to enable etc), and I think you are the first person I see using them on windows. So I wonder if may be something "symlinks on Windows related"?

quick google gives me https://www.py4u.net/discuss/176524 . wild guess - / in symlink path? is there way to see where symlink points on windows?

emilyasterjones commented 2 years ago

Not sure what you mean by proper copies?

I eventually got everything to upload simply by running the upload command over and over, uploading a subset of the error-generating files each time, not changing anything between runs.

I can't replicate this bug anymore because everything is uploaded, and even when it was generating errors it was a random subset of files so I don't know what was causing the failures.

Here are all the steps I did:

  1. Used MatNWB to generate the NWB files. Original data and subsequent NWB files were all on a server. Matlab was running on my local machine.
  2. Mounted the server as a mapped network drive.
  3. Generated the conda env with required packages and set the pwd to a directory inside the mapped drive.
  4. Ran dandi validate on the original NWB files, then dandi download <archive>, cd to the newly created archive directory, then `dandi organize . This generated a copy of the original NWB files in the archive directory with files renamed.
  5. Several files in the archive directory had validation errors that they didn't have in the original NWB directory, resolved by copying the files from the original to the archive directory and renaming them. Collectively about 1/3rd of the NWB files wouldn't upload on the first attempt, due to either the error message or the debug message I posted above. Running dandi upload again would upload a subset of those files without anything changing between runs. After running dandi upload 6 times, managed to get all the data uploaded.
yarikoptic commented 2 years ago

eh, so we would let the bugs prosper :-/ would you be so kind to do smth like grep "setting files_mode" C:\Users\emily.jones\AppData\Local\dandi\dandi-cli\Logs\20211117* -- I still desire confirmation that mode was set to symlink.

emilyasterjones commented 2 years ago

That returns no results. None of my log files, from uploading or validating, contain anything resembling it (setting file, setting file_mode, files_mode all return nothing).

emilyasterjones commented 2 years ago

Github won't allow me to upload the log files, so here they are on a Google drive:

yarikoptic commented 2 years ago

That returns no results. None of my log files, from uploading or validating, contain anything resembling it (setting file, setting file_mode, files_mode all return nothing).

I guess we do not safe log to file for some other commands (like organize) -- filed https://github.com/dandi/dandi-cli/issues/838 .

yarikoptic commented 2 years ago

@emilyasterjones did you have a chance/need to upload since back then, and either you encountered similar issues?

yarikoptic commented 2 years ago

I am afraid the mystery would remain for now, until/if someone would run into a similar odd behavior. Meanwhile I will close this issue but feel free to reopen if issue persists.