NDAR / nda-tools

Python package for interacting with NDA web services. Used to validate, submit, and download data to and from NDA.
MIT License
47 stars 22 forks source link

Parallel starting of downloadcmd causes file collision #84

Closed ericearl closed 7 months ago

ericearl commented 8 months ago

@gregmagdits I'm having a problem similar to #49 with downloadcmd.

When two or more downloadcmd calls start in parallel, and more importantly they start at the exact same second, then the first to successfully download deletes the S3 links log file ~/NDA/nda-tools/downloadcmd/logs/failed_s3_links_file_YYYYMMDDThhmmss.txt afterward (where YYYYMMDDThhmmss is the year, month, day, hour, minute, and second of beginning downloadcmd). This leaves every other job started at the same second without a file to delete and therefore throws the FileNotFoundError.

This could be prevented with slower serial calls or by staggering jobs, but that is not as good of a solution as allowing the output log file to be sent somewhere else via a command line argument to downloadcmd or some input config file, especially considering the use of downloadcmd in an HPC cluster.

Thanks and let me know if you need any logs or further information to reproduce this error yourselves.

gregmagdits commented 8 months ago

We should be able to address this but I am wondering what the use-case is that requires running multiple instances of the downloadcmd simultaneously. If the end goal is to increase the number parallel downloads I would suggest using the -wt option instead which allows you to explicitly set the number of parallel downloads, since it uses threads instead of spawning new processes.

ericearl commented 8 months ago

@gregmagdits Thanks for the fast reply!

downloadcmd is already integrated with the DCAN-Labs/abcd-dicom2bids software where one downloadcmd call is made for each list of S3 links for a single MRI session's DICOM-containing TGZ series, as uploaded by the DAIRC to collection 2573.

This technique allows for each list of S3 links (for TGZ files containing DICOMs) that are specific to a subject and a session to be placed in a different destination folder. This logical file separation helps.

gregmagdits commented 7 months ago

This has been fixed in the most recent release of the tools, 0.2.26

ericearl commented 7 months ago

Excellent! Thanks! I'll try it out the next time I download. Is the documentation for the option in the --help flag to downloadcmd? I couldn't tell from here, or maybe I missed it in reading/searching...

ericearl commented 7 months ago

Oh, wait! I see it now. I mistook the -d argument for an argument that already existed before...

https://github.com/NDAR/nda-tools/blob/master/NDATools/clientscripts/downloadcmd.py#L57-L58

Can I request the help text be updated to make it clearer what this is?

From:

Enter an alternate full directory path where you would like your files to be saved. The default is ~/NDA/nda-tools/<package-id>

To:

Enter an alternate full directory path where you would like your downloadcmd log files to be saved. The default is ~/NDA/nda-tools/<package-id>

gregmagdits commented 7 months ago

@ericearl We didn't add an option to change the log directory as suggested in #49 . Instead we created the failed_s3_links file with a NamedTemporaryFile to guarantee a unique file name during file creation. I think we are going to start making more frequent releases for nda-tools. If you think that making the location of the logs files configurable would still be helpful, even after this change we can try to include it in one of the upcoming releases, which we plan on having now about once per month

j-tseng commented 7 months ago

@gregmagdits: +1 to the idea of being able to configure the location of the log files (along with the metadata text file download location, etc.).

We work off of our cluster and user home directories have limited space (10gb). This gets taken up quickly, even by the metadata text file alone, but also by the log files and residual files in the .downloadprogress hidden folder. Before we realized that space was the issue, the metadata text *.gz would seemingly download and attempt to extract, but be broken and cause a pandas parsing error when attempting to find files matching the --regexp argument. For now, I've modified the __init__.py script to build this folder in a different location for us, which fixed the text file download issue.

ericearl commented 5 days ago

@gregmagdits This issue is biting me again... Should I open a new GitHub issue as a feature request to "Allow configuration of the NDA_ORGINIZATION_ROOT_FOLDER" or do you want to manage it inside this GitHub issue here?

I'm thinking this issue could be resolved by allowing configurable or user-input global variables in the https://github.com/NDAR/nda-tools/blob/main/NDATools/__init__.py#L57C1-L76. Anyway, my current plan feels convoluted because of this need:

  1. Get an HPC instance with a local scratch space
  2. pip install nda-tools -t /scratch/$SLURM_JOB_ID
  3. Update the PYTHONPATH to prioritize use of this new nda-tools instance
  4. Run a script to replace the os.path.expanduser('~') with the new /scratch/$SLURM_JOB_ID in https://github.com/NDAR/nda-tools/blob/main/NDATools/__init__.py#L57
  5. Then I can run downloadcmd safely in parallel without all the default files going to my home directory

Is this basically what you did, @j-tseng?

gregmagdits commented 8 hours ago

@ericearl - yes please open a new issue since this seems unrelated to the original. I will try to get it into the next release.