Closed ericearl closed 11 months ago
We should be able to address this but I am wondering what the use-case is that requires running multiple instances of the downloadcmd simultaneously. If the end goal is to increase the number parallel downloads I would suggest using the -wt option instead which allows you to explicitly set the number of parallel downloads, since it uses threads instead of spawning new processes.
@gregmagdits Thanks for the fast reply!
downloadcmd
is already integrated with the DCAN-Labs/abcd-dicom2bids software where one downloadcmd
call is made for each list of S3 links for a single MRI session's DICOM-containing TGZ series, as uploaded by the DAIRC to collection 2573.
This technique allows for each list of S3 links (for TGZ files containing DICOMs) that are specific to a subject and a session to be placed in a different destination folder. This logical file separation helps.
This has been fixed in the most recent release of the tools, 0.2.26
Excellent! Thanks! I'll try it out the next time I download. Is the documentation for the option in the --help
flag to downloadcmd
? I couldn't tell from here, or maybe I missed it in reading/searching...
Oh, wait! I see it now. I mistook the -d
argument for an argument that already existed before...
https://github.com/NDAR/nda-tools/blob/master/NDATools/clientscripts/downloadcmd.py#L57-L58
Can I request the help text be updated to make it clearer what this is?
From:
Enter an alternate full directory path where you would like your files to be saved. The default is
~/NDA/nda-tools/<package-id>
To:
Enter an alternate full directory path where you would like your downloadcmd log files to be saved. The default is
~/NDA/nda-tools/<package-id>
@ericearl We didn't add an option to change the log directory as suggested in #49 . Instead we created the failed_s3_links file with a NamedTemporaryFile to guarantee a unique file name during file creation. I think we are going to start making more frequent releases for nda-tools. If you think that making the location of the logs files configurable would still be helpful, even after this change we can try to include it in one of the upcoming releases, which we plan on having now about once per month
@gregmagdits: +1 to the idea of being able to configure the location of the log files (along with the metadata text file download location, etc.).
We work off of our cluster and user home directories have limited space (10gb). This gets taken up quickly, even by the metadata text file alone, but also by the log files and residual files in the .downloadprogress hidden folder. Before we realized that space was the issue, the metadata text *.gz would seemingly download and attempt to extract, but be broken and cause a pandas parsing error when attempting to find files matching the --regexp argument. For now, I've modified the __init__.py
script to build this folder in a different location for us, which fixed the text file download issue.
@gregmagdits This issue is biting me again... Should I open a new GitHub issue as a feature request to "Allow configuration of the NDA_ORGINIZATION_ROOT_FOLDER" or do you want to manage it inside this GitHub issue here?
I'm thinking this issue could be resolved by allowing configurable or user-input global variables in the https://github.com/NDAR/nda-tools/blob/main/NDATools/__init__.py#L57C1-L76. Anyway, my current plan feels convoluted because of this need:
pip install nda-tools -t /scratch/$SLURM_JOB_ID
PYTHONPATH
to prioritize use of this new nda-tools
instanceos.path.expanduser('~')
with the new /scratch/$SLURM_JOB_ID
in https://github.com/NDAR/nda-tools/blob/main/NDATools/__init__.py#L57downloadcmd
safely in parallel without all the default files going to my home directoryIs this basically what you did, @j-tseng?
@ericearl - yes please open a new issue since this seems unrelated to the original. I will try to get it into the next release.
@gregmagdits I'm having a problem similar to #49 with
downloadcmd
.When two or more
downloadcmd
calls start in parallel, and more importantly they start at the exact same second, then the first to successfully download deletes the S3 links log file~/NDA/nda-tools/downloadcmd/logs/failed_s3_links_file_YYYYMMDDThhmmss.txt
afterward (whereYYYYMMDDThhmmss
is the year, month, day, hour, minute, and second of beginningdownloadcmd
). This leaves every other job started at the same second without a file to delete and therefore throws theFileNotFoundError
.This could be prevented with slower serial calls or by staggering jobs, but that is not as good of a solution as allowing the output log file to be sent somewhere else via a command line argument to
downloadcmd
or some input config file, especially considering the use ofdownloadcmd
in an HPC cluster.Thanks and let me know if you need any logs or further information to reproduce this error yourselves.