NDAR / nda-tools

Python package for interacting with NDA web services. Used to validate, submit, and download data to and from NDA.
MIT License
48 stars 21 forks source link

S3-only download can't get metadata #82

Closed dmd closed 11 months ago

dmd commented 1 year ago

I don't know if this is a dupe of #67 but it seems that if you want to download to S3, you have to download package_file_metadata.txt.gz locally first.

Is this intentional or a bug?

E.g.:

$ downloadcmd -u ddrucker -dp 1220860 -t onefile  -s3 s3://rapidtide-nda/test20231026
Running NDATools Version 0.2.25

No value specified for --workerThreads. Using the default option of 7
Important - You can configure the thread count setting using the --workerThreads argument to maximize your download speed.

Getting Package Information...

Package-id: 1220860
Name: HCPAgingAllFiles
Has associated files?: Yes
Number of files in package: 1414735
Total Package Size: 22.33TB

Starting download: s3://rapidtide-nda/test20231026/package_file_metadata.txt.gz
Traceback (most recent call last):
  File "/Users/dmd/venvs/apc-ve/bin/downloadcmd", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/dmd/venvs/apc-ve/lib/python3.11/site-packages/NDATools/clientscripts/downloadcmd.py", line 200, in main
    s3Download.start()
  File "/Users/dmd/venvs/apc-ve/lib/python3.11/site-packages/NDATools/Download.py", line 198, in start
    self.download_package_metadata_file()
  File "/Users/dmd/venvs/apc-ve/lib/python3.11/site-packages/NDATools/Download.py", line 889, in download_package_metadata_file
    with gzip.open(download_location, 'rb') as f_in:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.6/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py", line 58, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.6/Frameworks/Python.framework/Versions/3.11/lib/python3.11/gzip.py", line 174, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/Users/dmd/cloud-brains/hcpage/package_file_metadata.txt.gz'
dmd commented 1 year ago

Related question - assuming this is intentional, how can you download just the metadata? I know I can run

downloadcmd -u ddrucker -dp 1220860 -d .

and then Control-c out of it once it's downloaded package_file_metadata.txt.gz, but that can't be automated. How can I tell it "download the metadata, and nothing else" so I can then, in the next command, use -s3?

gregmagdits commented 1 year ago

yes, the issue mentioned in the first comment is related to #67. We are planning to have this fixed in the next release.

If you want to download just the package-file-metadata file, you can run : downloadcmd -dp 1220860 --file-regex "package_file_metadata.txt.gz"

The tool will say it didn't find any matching files because the metadata file doesn't contain a record for itself, but the tool always downloads this file before downloading any other file in the package, so it will be downloaded (if it doens't already exist locally). We can add a low priority ticket to include a record for the metadata file itself to the metadata file so that the output of the program is accurate in this particular case.

gregmagdits commented 1 year ago

I don't think you were asking about the data-structure files, but in case you were, you can get those with the following regex: downloadcmd -dp 1220860 --file-regex "^[^/]+.txt"

dmd commented 1 year ago

I wasn't, but that made me realize the solution to "download just the metadata" is:

downloadcmd -dp 1220860 --file-regex match-nothing

which will happily download the package_file_metadata.txt.gz, then download 0 files and exit, which is exactly what I want.

This still doesn't solve the underlying bug that if you're using S3, the metadata should live there too, though.

gregmagdits commented 11 months ago

In all new packages the metadata file includes itself, which means you can run

downloadcmd -dp <package-id> --file-regex package_file_metadata_<package-id>.txt.gz

and there should be 1 file that meets the filter criteria.

dmd commented 11 months ago

Any reason to do that vs. match-nothing?

dmd commented 11 months ago

And what about what I said about S3? Or is the idea that you don't want the metadata remote? (In my case, it is anyway, because the "local" directory is S3-mounted.)

gregmagdits commented 11 months ago

Any reason to do that vs. match-nothing?

End result is the same. I guess the intent of the user is more clear if you actually specify the file you want to download

And what about what I said about S3? Or is the idea that you don't want the metadata remote? (In my case, it is anyway, because the "local" directory is S3-mounted.)

The metadata file is used extensively by the program so for now we decided to always have that local. Regarding your use of s3fs - the way the program works now (when downloading locally) is to append a .partial extension to files as they are being downloaded, and then rename the file when the download is complete. The rename operation is not implemented by tools like s3fs, so we were under the impression that this needs to change before the downloadcmd can work with s3 mounted directories. Have you not run into this situation? In any case, I think using the -s3 flag would be more efficient than downloading to a s3 mounted file system because s3-to-s3 transfers dont leave Amazon's cloud.

dmd commented 11 months ago

(a) s3fs does support rename (even deep directory rename), and has for many years now (almost a decade!)

(b) But regardless, we are in fact using the -s3 flag for the actual download. We just mount the bucket for later use using s3fs. (And we're doing all this from EC2, so nothing leaves AWS regardless.)

AHA - but, in re-testing this just now, it appears you made an important change in 01a7b08 that changes all of this - you're downloading package_file_metadata to nda-tools/downloadcmd/packages/<package-id>/ rather than to the supplied --directory.

So this is moot anyway!