NDAR / nda-tools

Python package for interacting with NDA web services. Used to validate, submit, and download data to and from NDA.
MIT License
48 stars 22 forks source link

Allow configuration of the NDA_ORGINIZATION_ROOT_FOLDER #101

Open ericearl opened 1 month ago

ericearl commented 1 month ago

I have an issue where multiple highly-parallel downloadcmd jobs collide temporary files at the ~/NDA/nda-tools/downloadcmd/packages/${PACKAGE_ID}/.download-progress folder. I'm thinking this issue could be resolved by allowing configurable or user-input global variables in the https://github.com/NDAR/nda-tools/blob/main/NDATools/__init__.py#L57-L76. Anyway, my current plan feels convoluted because of this need:

  1. Get an HPC instance with a local scratch space
  2. pip install nda-tools -t /scratch/$SLURM_JOB_ID
  3. Update the PATH environment variable to prioritize use of this new nda-tools instance
  4. Run a script to replace the os.path.expanduser('~') with the new /scratch/$SLURM_JOB_ID in https://github.com/NDAR/nda-tools/blob/main/NDATools/__init__.py#L57
  5. Copy back the /scratch/$SLURM_JOB_ID/bin/downloadcmd to /scratch/$SLURM_JOB_ID/downloadcmd (because of relative import problems)
  6. Then I can run downloadcmd safely in parallel without all the default files going to my home directory

I wonder if this is basically what you described having done on #84, @j-tseng?

P.S. Below is my blurb of code that does all that. It comes from https://github.com/nimh-dsst/abcd-fasttrack2bids/blob/main/swarm.sh#L70 and you can read my fix_downloadcmd.py script here: https://github.com/nimh-dsst/abcd-fasttrack2bids/blob/main/fix_downloadcmd.py.

DOWNLOADCMD_PATH=/lscratch/${SLURM_JOB_ID}/pip_install
mkdir ${DOWNLOADCMD_PATH}
poetry run --directory ${CODE_DIR} python -m pip install nda-tools -t ${DOWNLOADCMD_PATH}
poetry run --directory ${CODE_DIR} python ${CODE_DIR}/fix_downloadcmd.py ${DOWNLOADCMD_PATH}
cp ${DOWNLOADCMD_PATH}/bin/downloadcmd ${DOWNLOADCMD_PATH}/ 
export PATH=${DOWNLOADCMD_PATH}:${PATH}
j-tseng commented 1 month ago

Hi Eric,

We had a two-part issue: 1) home folder caps of 10GB so I changed the (~) to a different destination, and 2) we suspected that collisions were happening due to the parallel download requests happening too quickly, so we added a random jitter to the download request (using sleep at the job script outset to sample from a 10 min uniform time delay distribution) and that seems to address it.

However, I agree that an actual fix to the nda-tools repo to allow user-specified folders would be best. How come you're using this strategy (installing nda-tools on each job instance and running the script to replace the ~) vs. forking the repo and modifying? Taking a quick look, adding a custom arg to change this path is probably something like: 1) adding the arg to downloadcmd.py such that the arg is available when the init is called at line 170, which 2) then routes to __init__.py under init_and_create_configuration, relevant since that's the class that specifies the folder locations.

Hope that helps! JT

P.S. Apologies for the horrendous email auto-formatting.

gregmagdits commented 1 month ago

Hi @ericearl / @j-tseng I am looking into this now. I can see how it would be beneficial to change the location of the root directory in the case where space in the users home directory is limited, however I do not see how this will help issues related to file-name collisions, so please help me understand the trouble you guys are seeing. The only temporary files that the program produces are: 1) .partial files (when each file is downloading, it is temporarily named .partial and then the suffix is removed when the download completes.) 2) failed-s3-links files. The file starts with failed_s3_linksfile, then has a timestamp, a unique suffix and a .csv extension. A typical file-name is failed_s3_links_file_20240715T113932puj1d1yf.csv

Are one of these two files involved in the file-name collisions that are being observed? If not, please provide some more information about the errors that are being observed.

p.s. for the case where the space on the ~ directory is limited, instead of modifying the codebase it should be possible to create a symbolic link from ~/NDA to wherever the file-system space is not limited.

Thanks

j-tseng commented 1 month ago

Hi @gregmagdits - apologies, I'm a little fuzzy on this since the time delay strategy fixed our collision issue... I think when we investigated, it was something to do with the *.partial files but @ericearl might have better feedback since he's actively facing this issue.

Symbolic link workaround is a clever temp solution but I would still love to see the ability to specify output paths for full transparency. Plus, this may not work for folks whose filesystems don't support symbolic links across quota domains.

Thanks for your help!

gregmagdits commented 1 month ago

@ericearl the next version of the tools is supposed to be released relatively soon. Can you provide more details about the file-name collisions you observed so that we can try to fit this into the next release?

ericearl commented 1 month ago

@gregmagdits I have since deleted the logs that were full of something like a CSV value error where one of the temporary NDA CSV files was being written to concurrently and as soon as it was "double-written-to" once, every subsequent downloadcmd failed with the CSV issue. The solution was to rm -rf the .download-progress subfolder.

gregmagdits commented 1 month ago

Right, that makes sense. @j-tseng found another open issue that might be a duplicate of this one (#94) . I think we have enough information to get started.

ericearl commented 1 month ago

Yeah, just launch 1,000 or so concurrent downloadcmd jobs targeting different S3 links for each one and it will likely happen.