broadinstitute / pilon

Pilon is an automated genome assembly improvement and variant detection tool
GNU General Public License v2.0
340 stars 60 forks source link

Erroneous expansion of repeats? #71

Closed EarlyEvol closed 6 years ago

EarlyEvol commented 6 years ago

Hi all, I have three Hymenoptera assemblies I am working on with ~30X PacBio, and 50X PE and 40X MP Illumina for each. I generated hybrid assemblies with DBG2OLC and Masurca and an Illumina only assembly with SOAPdenovo (I also tried CANU but these assemblies were garbage probably because of the low coverage) with the intention of merging them. When using PE and MP alignments (bwa mem) to run Pilon on the SOAP or arrow corrected hybrid assemblies, the total assembly lengths jumped up quite a bit for all three (~1.25). From several methods (flow cytometry and kmer estimation), we have determined the genome size of Species 1 to be around 380MB. So the original assemblies are too large*, but after Pilon, they get even bigger. This inflation of assembly length is consistent across all runs of Pilon that I have done.

Species #1

DBG2OLC assembly | n50(kb) | Tot_len(mb) | #scafs backbone | 135 | 472 | 6271 arrow | 133 | 465 | 6271 post_Pilon | 182 | 597 | 6271

SOAP assembly | n50(kb) | Tot_len(mb) | #scafs Raw_scafs | 260 | 582 | 29736 post_Pilon |327 | 712 |  29736

The Pilon command I used was: java -Xmx300G -jar ~/jars/pilon-1.22.jar --fix all,breaks --genome $ref \ --bam ../PE_align.min250/$SAMPLE.$assembly.bwa.PE.srt.rmd.bam \ --jumps ../MP_align.min250/$SAMPLE.$assembly.bwa.MP.srt.rmd.bam \ --tracks --changes --threads $threads --output $SAMPLE.$assembly.$kmer_dir.pilon

I used "breaks" thinking that it could break scaffolds which are erroneous...I should have read the instructions :( but the total gap % goes down, so I dont think that's the culprit. I ran Pilon w/out the MP data and the same inflation occurred.

I have done homology searches and the vast majority of assembly length is from my species (680MB), which is 1.8X too large. When looking at a dot plot of before and after Pilon, it seems that the extra bases are from expansions of repeats.

When visualizing BWA alignments of PE data to the corrected assembly, the repetitive regions (mapping qual 0) have about 1/2 the coverage as the unique regions (mapping qual >40). I'm not sure about the behavior of BWA when MQ=0 but from what I understand it should randomly assign a position, right? Does this low coverage indicate that the repetitive regions are erroneously expanded?

Does Pilon take into account the length of Gaps when trying to assemble over them? The SOAP assembly is 30% gap. Should I remove Ns before running Pilon to reduce this behavior? This wont effect the expansions of the hybrid assemblies though.

This inflation is very consistent from every Pilon correction I have done. This is across three species with 3 different assembly methods. Am I missing something big here? Are there parameters I should use that would bias pilon to collapsing repeats? Was I supposed to suppress alignments with MQ=0? My species are very AT biased at about 28%GC. Would this reduction in uniqueness wreak havoc on the local reassemblies?

Thanks for Pilon and thanks in advance for any advice! Earl

*One interesting thing that has come of this: only assemblies of species 1 create files with too much sequence from the get go. The other two are about the correct size before Pilon, then inflate ~1.25 times. Does anyone know of genome characteristics which would lead to this outcome?

EarlyEvol commented 6 years ago

Ah, I believe I was mistaken.

Am I correct in thinking "fix break: Backbone_3543|arrow|arrow:26032-26709 25777 -1101 +5679 OpenedGap" is saying that the "breaks" function tried to fix some inconstancy by removing 1101 bases and adding 5679 bp, some of them Ns.

When looking at the change log, it seems like much of the extra sequence is from "fix breaks."

sum bases added by break resolution

$grep "fix break" change.log | grep -v "NoSolution" |awk -F " " '{sum +=$6} END {print sum}' 181239016

sum bases removed by break resolution

$grep "fix break" change.log | grep -v "NoSolution" |awk -F " " '{sum +=$5} END {print sum}' -41446364

This indicates that ~140 MB were added in the "fix breaks" process. I'm not sure if this is strictly true, but after inspecting a bunch of the file, it should at least be close. The assembly this is based on inflated by 132MB during pilon.

I'm running PIlon again without breaks now.

Earl

EarlyEvol commented 6 years ago

Ok, I solved the problem....and the answer is embarrassing. All the data was generated before I got to the lab, it turns out that my MP data was generated by the Nextera mate-pair protocol. This has a bunch of PE contamination and, more importantly, has a biotin stuffer (or junction) sequence in many reads. This indicates that the sequence is in fact from the correct side of the circle but needs to be trimmed. It should be a palindrome of the Nextera tagmentation seq. Having the full palindrome means (most likely) that two ends of a tagmentation fragment were ligated together, as opposed to another fragment or a stub during the second step of the library prep.

Anywho, this lead to having the same palindrome in about 1/2 of all my MP reads, which absolutely wreaked havoc on the local reassembles, especially during the experimental Break fixing.

Hope my pain and embarrassment help someone skip this problem.

Earl