Closed dmd closed 11 months ago
Related question - assuming this is intentional, how can you download just the metadata? I know I can run
downloadcmd -u ddrucker -dp 1220860 -d .
and then Control-c out of it once it's downloaded package_file_metadata.txt.gz
, but that can't be automated. How can I tell it "download the metadata, and nothing else" so I can then, in the next command, use -s3
?
yes, the issue mentioned in the first comment is related to #67. We are planning to have this fixed in the next release.
If you want to download just the package-file-metadata file, you can run : downloadcmd -dp 1220860 --file-regex "package_file_metadata.txt.gz"
The tool will say it didn't find any matching files because the metadata file doesn't contain a record for itself, but the tool always downloads this file before downloading any other file in the package, so it will be downloaded (if it doens't already exist locally). We can add a low priority ticket to include a record for the metadata file itself to the metadata file so that the output of the program is accurate in this particular case.
I don't think you were asking about the data-structure files, but in case you were, you can get those with the following regex:
downloadcmd -dp 1220860 --file-regex "^[^/]+.txt"
I wasn't, but that made me realize the solution to "download just the metadata" is:
downloadcmd -dp 1220860 --file-regex match-nothing
which will happily download the package_file_metadata.txt.gz
, then download 0 files and exit, which is exactly what I want.
This still doesn't solve the underlying bug that if you're using S3, the metadata should live there too, though.
In all new packages the metadata file includes itself, which means you can run
downloadcmd -dp <package-id> --file-regex package_file_metadata_<package-id>.txt.gz
and there should be 1 file that meets the filter criteria.
Any reason to do that vs. match-nothing?
And what about what I said about S3? Or is the idea that you don't want the metadata remote? (In my case, it is anyway, because the "local" directory is S3-mounted.)
Any reason to do that vs. match-nothing?
End result is the same. I guess the intent of the user is more clear if you actually specify the file you want to download
And what about what I said about S3? Or is the idea that you don't want the metadata remote? (In my case, it is anyway, because the "local" directory is S3-mounted.)
The metadata file is used extensively by the program so for now we decided to always have that local. Regarding your use of s3fs - the way the program works now (when downloading locally) is to append a .partial extension to files as they are being downloaded, and then rename the file when the download is complete. The rename operation is not implemented by tools like s3fs, so we were under the impression that this needs to change before the downloadcmd can work with s3 mounted directories. Have you not run into this situation? In any case, I think using the -s3 flag would be more efficient than downloading to a s3 mounted file system because s3-to-s3 transfers dont leave Amazon's cloud.
(a) s3fs does support rename (even deep directory rename), and has for many years now (almost a decade!)
(b) But regardless, we are in fact using the -s3
flag for the actual download. We just mount the bucket for later use using s3fs. (And we're doing all this from EC2, so nothing leaves AWS regardless.)
AHA - but, in re-testing this just now, it appears you made an important change in 01a7b08 that changes all of this - you're downloading package_file_metadata
to nda-tools/downloadcmd/packages/<package-id>/
rather than to the supplied --directory
.
So this is moot anyway!
I don't know if this is a dupe of #67 but it seems that if you want to download to S3, you have to download
package_file_metadata.txt.gz
locally first.Is this intentional or a bug?
E.g.: