faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
76 stars 48 forks source link

Upscaling phyluce for large datasets #275

Open karlyhiggins opened 2 years ago

karlyhiggins commented 2 years ago

Hello,

I am utilizing tutorial 1 to process my reads from 170 individuals. The reads can be upwards of 10million for some individuals and I am stuck on the assembly step taking a long time. It seems to take around a full day for one individual. I am running phyluce on a HPC and all appears to be working correctly. I was wondering if there are any suggestions for speeding it up? My only thought has been to split the run so I can submit multiple individuals at once.

Here is the example of my submission script, I can submit up to 20 cores per node and up to 10 nodes.

SBATCH --nodes=1

SBATCH --ntasks=20

SBATCH -p long.q

SBATCH --mem=56G

SBATCH --time=0-120:00:00

SBATCH --job-name=sym_phyluce

SBATCH --export=ALL

source /home/khiggins/miniconda3/etc/profile.d/conda.sh conda activate phyluce-1.7.1

phyluce_assembly_assemblo_spades --conf assembly.conf --output spades-assemblies --memory 56 --cores 20

brantfaircloth commented 2 years ago

When I have lots of assemblies to run, I usually split the input files into batches of something like 10-20 taxa, and then run those across different nodes in parallel. Since it looks like you are using slurm, you could look into using job arrays which might work well for your use-case. I also tend to randomly downsample (using seqtk) the R1 and R2 input files to ~2-3 million reads each (for the tetrapod bait set) prior to assembly, which makes things go MUCH faster (10 million reads is a lot).

That said, we have found that inputting more reads than this can sometimes be beneficial for the assembly of UCE contigs from toepads and/or other historical sources.