Closed leonvarhan closed 7 years ago
@leonvarhan : As an aside to your question, I run my assemblies outside of the Phyluce pipeline when using HPC. The grid I use is similar to a Sun Grid Engine, so my statements are qsub etc. I'm posting the bash script I use as an example of how to get around running only one sample at a time. You will likely need to modify it for your HPC instance. Doing it this way might be overkill, but I found that running the phyluce_assembly_assemblo_trinity was taking too long given how many CPU's I had access to, and how many samples we process. I also wanted to avoid having to make hundreds of configuration files.
Also, take care that if you do run phyluce_assembly_assemblo_trinity on a large number of samples (100+), that the Trinity files are being cleaned up. I broke our grid system that way...
chmod +x [bash_file]
--
#!/bin/sh
me=`basename "$0"`
#check syntax
if [ $# -ne 3 ]; then
echo Script needs directory input.
echo Script usage: $me ./path/to/uce-clean ./path/to/output-dir ./path/to/job-file
exit 1
fi
#get variables
workdir=$(readlink -e $1)
outputdir=$(readlink -f $2)
jobfile=$(readlink -e $3)
#check jobfile existence
if [ ! -f "$jobfile" ]; then
echo "Cant find job-file $jobfile check if it exists."
exit 1
fi
#check outdir existence
if [ -d $outputdir ]; then
echo "Directory '$outputdir/$taxon' exists, check output directory to avoid overwrite."
exit 1
else
mkdir $outputdir
fi
#check logfile existence
if [ -d $outputdir/job_logs ]; then
echo "Log directory '$outputdir/job_logs' exists"
else
mkdir $outputdir/job_logs
fi
#####
#loop over taxa
for ARQ in $workdir/*
do
taxon=`basename "$ARQ"`;
#check if taxon directory is present.
if [ -d $outputdir/$taxon ]; then
echo "Directory '$outputdir/$taxon' exists, check output directory to avoid overwrite."
exit 1
else
mkdir $outputdir/$taxon
fi
#concat read1 and singletons if needed.
#submit to cluster
if [ -e $ARQ/split-adapter-quality-trimmed/$taxon-READ1_cat.fastq.gz ]; then
echo "Cat file for $taxon already exists. Submitting."
qsub -q lThM.q -N trinity -S /bin/sh -cwd -o $outputdir/job_logs/job_$taxon.out -v TAXON=$taxon,LOC=$ARQ,OUTDIR=$outputdir -j y -l mres=50G,h_data=50G,h_vmem=50G,himem, -pe mthread 4 $jobfile
else
echo "Adding singletons onto R1 for $taxon, then submitting."
cat $ARQ/split-adapter-quality-trimmed/$taxon-READ-singleton.fastq.gz $ARQ/split-adapter-quality-trimmed/$taxon-READ1.fastq.gz > $ARQ/split-adapter-quality-trimmed/$taxon-READ1_cat.fastq.gz
#rm $ARQ/split-adapter-quality-trimmed/$taxon-READ1.fastq.gz $ARQ/split-adapter-quality-trimmed/$taxon-READ-singleton.fastq.gz
#To avoid filling the drive with too many files, the job file deletes the '*_cat' file.
qsub -q lThM.q -N trinity -S /bin/sh -cwd -o $outputdir/job_logs/job_$taxon.out -v TAXON=$taxon,LOC=$ARQ,OUTDIR=$outputdir -j y -l mres=50G,h_data=50G,h_vmem=50G,himem, -pe mthread 4 $jobfile
fi
done
--
module load bioinformatics/trinity/r2013_2_25
Trinity --CPU $NSLOTS --seqType fq --JM 50G --left $LOC/split-adapter-quality-trimmed/$TAXON-READ1_cat.fastq.gz --right $LOC/split-adapter-quality-trimmed/$TAXON-READ2.fastq.gz --full_cleanup --min_kmer_cov 2 --output $OUTDIR/$TAXON
rm -r $OUTDIR/$TAXON
rm $LOC/split-adapter-quality-trimmed/$TAXON-READ1_cat.fastq.gz $LOC/split-adapter-quality-trimmed/$TAXON-READ1_cat.fastq $LOC/split-adapter-quality-trimmed/$TAXON-READ2.fastq
--
nohup ./2.trinity_submission.sh /path/to/clean-fastq /path/to/trinity_assemblies /path/to/trinity.job > trin_submission_nohup.out&
Yes, you could generate many assembly.conf
files and just submit each as an individual job. Mike's suggestion is also a good one (using a particular script for HPC). On our HPC, we use GNU parallel to sometimes run the Trinity assemblies. You can also try to use the --dir
option instead of the --conf
option - the former simply takes a directory full of reads and created an output folder holding the assemblies.
What you will find, depending on how you go about it, is that you may need to create your own contigs
folder where all the assemblies from each individual run are symlinked, so you get something like:
my_symlink.contigs.fasta -> /path/to/my/individual/assembly/Trinity.fasta
This is basically one thing that phyluce_assembly_assemblo_trinity
doing. You'll also have to deal with cleaning up intermediate files, which often cause problems on HPC systems.
Hello, I am trying to run my data assembly in an HPC, but it is taking too long to get started and the job gets killed due to not enough memory or not enough time. I was wondering if I could run each species separately creating individual "assembly.conf" files. Will there be any conflict if I run the "phyluce_assembly_assemblo_trinity" program for all 50 species simultaneously ??? Will it generate 50 separate "phyluce_assembly_assemblo_trinity.log" files??? for example: phyluce_assembly_assemblo_trinity \ --conf assembly_species_01.conf \ --output trinity_assemblies_species01 \ --subfolder split-adapter-quality-trimmed \ --clean \ --cores 28 phyluce_assembly_assemblo_trinity \ --conf assembly_species_02.conf \ --output trinity_assemblies_species02 \ --subfolder split-adapter-quality-trimmed \ --clean \ --cores 28 … etc Thank you for your time, LV.