NDAR / nda-tools

Python package for interacting with NDA web services. Used to validate, submit, and download data to and from NDA.
MIT License
48 stars 22 forks source link

why does os.rename only sometimes fail on s3fs? #100

Open dmd opened 3 months ago

dmd commented 3 months ago

https://github.com/NDAR/nda-tools/blob/f459a028ff209484311f9e61303f39e21d4d448d/NDATools/Download.py#L464

I'm finding that this os.rename call only sometimes fails on a s3fs mountpoint. Any idea why that is?

liningpan commented 3 months ago

Are those failed files greater than 5 GB?

dmd commented 3 months ago

No, much smaller. However I'm thinking this is actually a different issue. It looks like maybe the file is in the manifest, but doesn't actually exist? What is happening here?

$ downloadcmd --username bfrederick -dp 1184998 --file-regex Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv
Running NDATools Version 0.2.27
Using configuration file from /Users/dmd/.NDATools/settings.cfg
proceeding as nda user: bfrederick

No value specified for --workerThreads. Using the default option of 7
Important - You can configure the thread count setting using the --workerThreads argument to maximize your download speed.

Getting Package Information...

Package-id: 1184998
Name: HCPAgingAllFiles
Has associated files?: Yes
Number of files in package: 1414736
Total Package Size: 22.33TB

Starting download: /Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/package_file_metadata_1184998.txt.gz.partial
Completed download /Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/package_file_metadata_1184998.txt.gz

S3 links for files that failed to download will be written out to /Users/dmd/NDA/nda-tools/downloadcmd/logs/failed_s3_links_file_20240613T1226584kewmcog.csv. You can attempt to download these files later by running:
    downloadcmd -dp 1184998 --file-regex Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv -u bfrederick -d /Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998 -wt 7 -t "/Users/dmd/NDA/nda-tools/downloadcmd/logs/failed_s3_links_file_20240613T1226584kewmcog.csv"

Beginning download of 3 files (9.72MB) matching Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv to /Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998 using 7 threads
Adding 5 files to download queue. Queue contains 5 files

Starting download: /Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/fmriresults01/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv.partial
Starting download: /Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/fmriresults01/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv.partial
Starting download: /Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/fmriresults01/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv.partial
Starting download: /Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/imagingcollection01/HCA6110138_V1_MR/unprocessed/rfMRI_REST2_PA/LINKED_DATA/PHYSIO/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv.partial
Starting download: /Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/image03/HCA6110138_V1_MR/unprocessed/rfMRI_REST2_PA/LINKED_DATA/PHYSIO/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv.partial
Completed download /Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/fmriresults01/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv
[Errno 2] No such file or directory: '/Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/fmriresults01/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv.partial' -> '/Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/fmriresults01/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv'
Traceback (most recent call last):
  File "/Users/dmd/venvs/apc-ve/lib/python3.11/site-packages/NDATools/Download.py", line 590, in download_from_s3link
    self.download_local(download_request, err_if_exists)
  File "/Users/dmd/venvs/apc-ve/lib/python3.11/site-packages/NDATools/Download.py", line 465, in download_local
    os.rename(download_request.partial_download_abs_path, download_request.completed_download_abs_path)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/fmriresults01/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv.partial' -> '/Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/fmriresults01/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv'

Completed download /Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/imagingcollection01/HCA6110138_V1_MR/unprocessed/rfMRI_REST2_PA/LINKED_DATA/PHYSIO/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv
Completed download /Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/image03/HCA6110138_V1_MR/unprocessed/rfMRI_REST2_PA/LINKED_DATA/PHYSIO/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv
[Errno 2] No such file or directory: '/Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/fmriresults01/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv.partial' -> '/Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/fmriresults01/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv'
Traceback (most recent call last):
  File "/Users/dmd/venvs/apc-ve/lib/python3.11/site-packages/NDATools/Download.py", line 590, in download_from_s3link
    self.download_local(download_request, err_if_exists)
  File "/Users/dmd/venvs/apc-ve/lib/python3.11/site-packages/NDATools/Download.py", line 465, in download_local
    os.rename(download_request.partial_download_abs_path, download_request.completed_download_abs_path)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/fmriresults01/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv.partial' -> '/Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/fmriresults01/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv'

Finished processing all download requests @ 2024-06-13 12:27:00.466081.
     Total download requests 5
     Total errors encountered: 2

 Exiting Program...
liningpan commented 3 months ago

There seems to be 3 separate threads trying to download the same file, which basically caused race condition. Not sure if the file appeared multiple times in the manifest or if something went wrong in the download tool.

Starting download: /Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/fmriresults01/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv.partial
Starting download: /Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/fmriresults01/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv.partial
Starting download: /Users/dmd/NDA/nda-tools/downloadcmd/packages/1184998/fmriresults01/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv.partial
dmd commented 3 months ago

Yes, it turns out the manifest has several versions of the file that are all specified to be written to the same path:

$ grep Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f package_file_metadata_1184998.txt | csvcut  -c 4 | sort | uniq -c
   3 fmriresults01/HCA6110138_V1_MR/MNINonLinear/Results/rfMRI_REST2_PA/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv
   1 image03/HCA6110138_V1_MR/unprocessed/rfMRI_REST2_PA/LINKED_DATA/PHYSIO/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv
   1 imagingcollection01/HCA6110138_V1_MR/unprocessed/rfMRI_REST2_PA/LINKED_DATA/PHYSIO/Physio_combined_76a6ae9e-b032-42a0-a0be-a30e9cf6c52f.csv

What can we do about this? Or do we just ignore it?

dmd commented 3 months ago

And to go back to the original question, what is the issue with os.rename on s3fs? In my testing os.rename works just fine on s3fs (if there isn't some other reason why the rename would fail on ANY fs)

liningpan commented 3 months ago

NDA should either fix their backend to not generate manifest with duplicated files or have the client side do de-duplication. @gregmagdits

In terms of s3fs, unlike a real filesystem rename is done by a copy and a remove operation, so not atomic. For files larger than 5GB, it has to use the multipart upload interface to do server side copy instead of a straight object copy. This kind of operation is generally less efficient for S3 and more error prone. For example, the official AWS S3 VFS mountpoint doesn't even want to support this operation https://github.com/awslabs/mountpoint-s3/issues/506. I think the C++ s3fs-fuse is handling large file rename correctly.

May I ask if your final destination is S3, why would you use s3fs? I think nda-tools technically support download to S3 directly. (We ran into issues with nda implementation before and they were supposed to be fixed)

dmd commented 3 months ago

Because we're writing a fairly generic tool (rapidtide-cloud to run rapidtide in AWS Batch), and one of the things that I want to be generic is the backing of our data-ingest location. I suppose I could write two codepaths depending on whether the backing is s3fs or a traditional filesystem, but that would be annoying.

gregmagdits commented 3 months ago

There is already a back-end procedure that renames files when there is a name collision. (related to https://github.com/NDAR/nda-tools/issues/88 ) . We will look into the cause of the duplicated entries in package 1184998 and send an update

gregmagdits commented 3 months ago

Package 1184998 is the original package that was created by HCP before the de-dupe procedure was in place. You can create a new package from the original in order to get rid of the duplicate entries. To do this you need to login to NDA, navigate to the packages dashboard, select 'shared packages' from the drop down, and click 'add to my data packages' from the actions menu. This will create a new package from the original and runs the procedure which removes duplicates by appending unique suffixes to files as needed.