Running individual assemblies

leonvarhan commented 7 years ago

Hello, I am trying to run my data assembly in an HPC, but it is taking too long to get started and the job gets killed due to not enough memory or not enough time. I was wondering if I could run each species separately creating individual "assembly.conf" files. Will there be any conflict if I run the "phyluce_assembly_assemblo_trinity" program for all 50 species simultaneously ??? Will it generate 50 separate "phyluce_assembly_assemblo_trinity.log" files??? for example: phyluce_assembly_assemblo_trinity \ --conf assembly_species_01.conf \ --output trinity_assemblies_species01 \ --subfolder split-adapter-quality-trimmed \ --clean \ --cores 28 phyluce_assembly_assemblo_trinity \ --conf assembly_species_02.conf \ --output trinity_assemblies_species02 \ --subfolder split-adapter-quality-trimmed \ --clean \ --cores 28 … etc Thank you for your time, LV.

MikeWLloyd commented 7 years ago

@leonvarhan : As an aside to your question, I run my assemblies outside of the Phyluce pipeline when using HPC. The grid I use is similar to a Sun Grid Engine, so my statements are qsub etc. I'm posting the bash script I use as an example of how to get around running only one sample at a time. You will likely need to modify it for your HPC instance. Doing it this way might be overkill, but I found that running the phyluce_assembly_assemblo_trinity was taking too long given how many CPU's I had access to, and how many samples we process. I also wanted to avoid having to make hundreds of configuration files.

Also, take care that if you do run phyluce_assembly_assemblo_trinity on a large number of samples (100+), that the Trinity files are being cleaned up. I broke our grid system that way...

Attached are the BASH script and job file for use on a Sun Grid Engine style HPC.

These have been extensively tested, and function on the current platform as of 06.20.17

To use these scripts:

Create the Bash file, and job file. These do not have to be in the same directory as the uce-clean folder.
chmod +x [bash_file]
Run the bash script.

--

Bash script file (2.trinity_submission.sh)

#!/bin/sh

me=`basename "$0"`

#check syntax
if [ $# -ne 3 ]; then
    echo Script needs directory input.
    echo Script usage: $me ./path/to/uce-clean ./path/to/output-dir ./path/to/job-file
    exit 1
fi

#get variables
workdir=$(readlink -e  $1)
outputdir=$(readlink -f $2)
jobfile=$(readlink -e $3)

#check jobfile existence
if [ ! -f "$jobfile" ]; then 
    echo "Cant find job-file $jobfile check if it exists."
    exit 1
fi

#check outdir existence
if [ -d $outputdir ]; then
    echo "Directory '$outputdir/$taxon' exists, check output directory to avoid overwrite."
    exit 1
else
    mkdir $outputdir
fi

#check logfile existence
if [ -d $outputdir/job_logs ]; then
    echo "Log directory '$outputdir/job_logs' exists"
else
    mkdir $outputdir/job_logs
fi

#####

#loop over taxa
for ARQ in $workdir/*
do

taxon=`basename "$ARQ"`;

#check if taxon directory is present.
if [ -d $outputdir/$taxon ]; then
    echo "Directory '$outputdir/$taxon' exists, check output directory to avoid overwrite."
    exit 1
else
    mkdir $outputdir/$taxon
fi

#concat read1 and singletons if needed. 
#submit to cluster
if [ -e $ARQ/split-adapter-quality-trimmed/$taxon-READ1_cat.fastq.gz ]; then
    echo "Cat file for $taxon already exists. Submitting."
    qsub -q lThM.q -N trinity -S /bin/sh -cwd -o $outputdir/job_logs/job_$taxon.out -v TAXON=$taxon,LOC=$ARQ,OUTDIR=$outputdir -j y -l mres=50G,h_data=50G,h_vmem=50G,himem, -pe mthread 4 $jobfile
else
    echo "Adding singletons onto R1 for $taxon, then submitting."
    cat $ARQ/split-adapter-quality-trimmed/$taxon-READ-singleton.fastq.gz $ARQ/split-adapter-quality-trimmed/$taxon-READ1.fastq.gz > $ARQ/split-adapter-quality-trimmed/$taxon-READ1_cat.fastq.gz
    #rm $ARQ/split-adapter-quality-trimmed/$taxon-READ1.fastq.gz $ARQ/split-adapter-quality-trimmed/$taxon-READ-singleton.fastq.gz
    #To avoid filling the drive with too many files, the job file deletes the '*_cat' file.  
    qsub -q lThM.q -N trinity -S /bin/sh -cwd -o $outputdir/job_logs/job_$taxon.out -v TAXON=$taxon,LOC=$ARQ,OUTDIR=$outputdir -j y -l mres=50G,h_data=50G,h_vmem=50G,himem, -pe mthread 4 $jobfile
fi

done

--

Job File (trinity.job)

module load bioinformatics/trinity/r2013_2_25
Trinity --CPU $NSLOTS --seqType fq --JM 50G --left $LOC/split-adapter-quality-trimmed/$TAXON-READ1_cat.fastq.gz --right $LOC/split-adapter-quality-trimmed/$TAXON-READ2.fastq.gz --full_cleanup --min_kmer_cov 2 --output $OUTDIR/$TAXON
rm -r $OUTDIR/$TAXON
rm $LOC/split-adapter-quality-trimmed/$TAXON-READ1_cat.fastq.gz $LOC/split-adapter-quality-trimmed/$TAXON-READ1_cat.fastq  $LOC/split-adapter-quality-trimmed/$TAXON-READ2.fastq

--

Sample call

nohup ./2.trinity_submission.sh /path/to/clean-fastq /path/to/trinity_assemblies /path/to/trinity.job > trin_submission_nohup.out&

brantfaircloth commented 7 years ago

Yes, you could generate many assembly.conf files and just submit each as an individual job. Mike's suggestion is also a good one (using a particular script for HPC). On our HPC, we use GNU parallel to sometimes run the Trinity assemblies. You can also try to use the --dir option instead of the --conf option - the former simply takes a directory full of reads and created an output folder holding the assemblies.

What you will find, depending on how you go about it, is that you may need to create your own contigs folder where all the assemblies from each individual run are symlinked, so you get something like:

my_symlink.contigs.fasta -> /path/to/my/individual/assembly/Trinity.fasta

This is basically one thing that phyluce_assembly_assemblo_trinity doing. You'll also have to deal with cleaning up intermediate files, which often cause problems on HPC systems.

faircloth-lab / phyluce