CCRGeneticsBranch / khanlab_ngs_pipeline

0 stars 1 forks source link

Review ${{LOCAL}} variable and remove - set to output directory #14

Closed slsevilla closed 1 year ago

slsevilla commented 2 years ago

Currently the pipeline is using the ${{LOCAL}} variable within many of its rules. This location is being used as the site for intermediary and temporary storage of files. This is not the most efficient use of disc space and should be updated to use LSCRATCH when possible. More significantly as related to issue #12 if this variable is set to a /vf/ directory then intermediary files may still be affected by the truncation of text files, regardless of whether or not the output directory is a non-/vf/ location. This will need to be immediately addressed in all rules before re-runs or new runs can occur.

slsevilla commented 1 year ago

I looked into this issue further.

It appears that ${{LOCAL}} is being defined as by the {HOST} variable, which is then defined in the config file. If the config is defining host=biowulf.nih.gov we do not have an issue; ${{LOCAL}} is being deployed in /lscratch/ (a temporary location relative to the node the data is being run). If host has defined as host=login01 then we would have an issue, as ${{LOCAL}} is then defined as "/projects/scratch/ngspipeline{SAMPLESHEET}{NOW}${{SLURM_JOB_ID}}/".

Will ask Xinyu to confirm that projects being more recently being run on non /vf/ locations did have the host variable set in the config file correctly. It appears that this is only being done in this location.

Code block defining ${{LOCAL}} is below.

shell.prefix("""
set -e -o pipefail
#module purge
sleep 20s
if [ {HOST} == 'biowulf.nih.gov' ]
    then
        MEM=`echo "${{SLURM_MEM_PER_NODE}} / 1024 "|bc`
        LOCAL="/lscratch/${{SLURM_JOBID}}/"
        THREADS=${{SLURM_CPUS_ON_NODE}}
elif [ {HOST} == 'login01' ]
    then
        module load slurm
        module load gcc/4.8.1
        MEM=`scontrol show job ${{SLURM_JOB_ID}} | grep "MinMemoryNode"| perl -n -e'/MinMemoryNode=(\d*)G/ && print $1'`
        mkdir -p /projects/scratch/ngs_pipeline_{SAMPLESHEET}_{NOW}_${{SLURM_JOB_ID}}/
        LOCAL="/projects/scratch/ngs_pipeline_{SAMPLESHEET}_{NOW}_${{SLURM_JOB_ID}}/"
        THREADS=`scontrol show job ${{SLURM_JOB_ID}} | grep  "MinCPUsNode" | perl -n -e'/MinCPUsNode=(\d*)/ && print $1'`
fi
""")
slsevilla commented 1 year ago

Per Xinyu, the current host is set to 'biowulf' so this is a non-issue with the recently run data.

See email below

Hello Sam,

The host login01 is our TGen server, not related to biowulf. Our pipeline only defines two hosts, the other one is biowulf. On this end we are fine with the pipeline configuration. Please let me know if this is not clear to you.

Thanks,

Xinyu