large number of 1bp or small contigs

osilander commented 1 year ago

I'm trying to assembly a ~2.3 Gbp genome from ~60X ONT reads (latest chemistry and Dorado basecalls). I get a very large number of contigs (40K) but many are 1bp or otherwise quite short, although contig N50 is relatively good (6Mbp) and contig N80 is ~3Mbp. This falls off right after with N95 around 35Kbp compared to 300-500Kbp N95 for other assemblers that otherwise behave similarly at the top end (raven and nextdenovo). If I filter out contigs shorter than 2Kbp there are still 25K contigs. For the most part these contigs should not even exist, as the read set itself has no reads below 2Kbp. Thanks, trying to get a handle on this output, seems like a very good assembler.

lcoombe commented 1 year ago

Hi @osilander,

Thanks for reaching out! Those shorter contigs are likely primarily due to the tigmint-long step, which detects and cuts the 'goldtigs' (golden path reads, pre-scaffolding) at putative misassemblies/chimeric regions. Depending on where those cuts are made you can end up with these very short sequences, which can be safely filtered out of the assembly.

It is also possible to have sequences shorter than the read lengths because the initial GoldPath stage performs some trimming on reads while generating the goldtigs/golden path (~1X representation of the underlying genome).

I hope that makes sense - just let us know if you have any other questions!

Thank you for your interest in GoldRush! Lauren

osilander commented 1 year ago

Thanks for the explanation. I was looking a little more into this and found that the contig length distribution seems quite odd. There are many contigs that are exactly (or very close) to specific (round) numbers - 2,000bp, 3,000bp, etc.

This becomes very apparent when you look at the histogram or cumulative curves(see below). For example, I have 40,014 total contigs. 2,059 are between 1,001bp and 1,999 bp in length but 2,673 are exactly 2,000bp in length. Similarly, 3,025 are between 2,001 and 2,999 in length; 138 are exactly 3,000bp, and 316 are between 2,999 and 3,001.

My read length distributions are very continuous (ONT 10.4.1, dorado basecalls). This contig length pattern continues up to approximately 20,000bp - there are unexpected bumps in contig lengths at 4,000 5,000 6,000 7,000 etc.

There is also a strange drop-off in contigs that are greater than 1,000bp compared to less than 1,000bp (attached). goldrush-hist.pdf

Is this possibly something specific to my install? Ubuntu 20.04.5, goldrush v1.0.1 I get no errors/warnings during assembly. Have you ever seen this before?

jwcodee commented 1 year ago

Hello. The reason you see a lot of contigs at those specific lengths is because the GoldPath module within GoldRush evaluates each read as non-overlapping tiles, which is by default of length 1,000 bp with the exception of the last tile. Part of the GoldPath module involves trimming reads based on overlap and since GoldPath is evaluating reads as a collection of tiles, trimming is done by removing tiles. The trimmed read will either of length (x remaining tiles * 1000 bp) or (x -1 remaining tiles + length of last tile).

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your interest in GoldRush!

bcgsc / goldrush

large number of 1bp or small contigs #115