Assembly strategy for large datasets

mtmcgowan commented 3 years ago

Hi Trans-abyss team,

I have a large dataset consisting of 99 libraries of ~35m reads each for a species that does not yet have a reference genome available and am interested in building a _de novo assembly. I have access to an HPC cluster (24 core, 250 Gb RAM) and have setup trans-abyss with singularity.

I am unsure whether it would be better to assemble a single transcriptome using all libraries or assemble each library separately and then merge them.

Based on your experience with your assembler, can you make any strategy recommendations based on my available computing resources?

kmnip commented 3 years ago

Hello @mtmcgowan ,

That is roughly 3.4 billion reads in total (99 * 35 million), which is probably too many for 250 GB RAM. So, I suggest that you assemble each library individually and merge the assemblies together.

You can combine all assemblies into one fasta file, then de-duplicate with BBMap's Dedupe script: https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/dedupe-guide/ and/or EvidentialGene: http://arthropods.eugenes.org/EvidentialGene/trassembly.html

Hope that helps! Ka Ming

mtmcgowan commented 3 years ago

Hi Ka Ming,

I took your advice to independently assemble smaller subsets of the data. Since my experimental design consisted of multiple genotypically distinct accessions with 3 biological replicates each, I opted to assemble at the accession level (combining replicates) for a total of 33 assemblies. At the default k-value (32), this took roughly 9 days to assemble.

I have another assembly strategy question regarding this dataset. It is recommended in the documentation that Trans-abyss be used with many different k-values before combining them into a single assembly. Based on your original paper and other papers I have seen use your tool, it seems that the recommended range to use is the default value (k = 32) up to the sequencer read length (150 in my case). I have seen that the most common step sizes seem to be either 1 or 2.

Since this step size would take a prohibitive amount of time and storage space, I need to determine how to best go about incorporating varying k-sizes for my final meta assembly. I have made assumptions that the assembly time is similar at different k-values (which may not be true), but here are two strategies that would seemingly take about the same amount of time:

Select a single accession with a genotype most representative of the entire population and assemble with many k-values with a small step size (1-2). My guess is that this would bias the final assembly towards isoforms specific to this genotype, but would reduce any bias associated with k-mer specific assembly artifacts.
Use a significantly larger step size (say k = 32, 91, 150), but generate assemblies for all genotypes. My guess is that this would be the opposite where genotype-specific bias would be reduced at the cost of more k-specific artifacts.

Based on your experience, do you have any suggestions?

kmnip commented 3 years ago

Hi @mtmcgowan With Trans-ABySS version 2, you don't need many small increments in k. For each accession (i.e. 3 replicates combined), you can probably just use 3 k-mer sizes (e.g. 32, 64, 96) and merge the 3 assemblies together. Finally, you can merge the 33 merged assemblies into a single FASTA file.

ZehaoLi666 commented 1 year ago

Hi @kmnip Is it necessary to set up 3 k-mer sizes for the merged step? Can I just set 2 or 1 k-mer for assembly?

kmnip commented 1 year ago

You can use as many/little k-mer sizes as you want.

bcgsc / transabyss

Assembly strategy for large datasets #24