Open kevfengler227 opened 3 years ago
above are the corrected reads from the p_ctg.noseq.gfa aligned to the assembly
Here is a view of a whole chromosome. 4 "bad" reads are causing most of the trouble
Does these reads consist of large number of homopolymer errors so hifiasm failed to correct them?
The portion of the "bad read" appears to have low complexity, but not necessarily just homopolymers. This portion is always at the beginning of the read and only spans a ~100-200 bp. The reads at these loci are not getting heavily corrected. It looks the same with HiFi or corrected. Here is a zoomed in example
Interestingly, the remaining consensus error appears to be associated with the beginning edge of a tiling read. Just looked at dozens of other examples that look like this (and INDEL immediately before a read)???
Maybe the "bad read" clusters are just extreme examples of this edge error effect?
There is a weird GA...TC thing going on at this sites?
I guess all these bad reads of assembly should be there but cannot be corrected, right?
yes, the contig assembly is correct, except for the consensus error brought in by the portion of the bad read
Thanks a lot. In the above examples, are these reads roughly aligned to their corresponding positions at p_ctg? For example, if we extract read A from the position 5000 of a contig and align A back to the contig, is the alignment position of A still around 5000?
Yes, the coordinates are the same.
May I ask is the last line of these two examples a read? If it is a read, looks like there are multiple reads that consent with each other, but hifiasm selects the only wrong one?
yes. All of these are reads from the p_ctg.gfa, and the bottom read is the one that agrees with consensus, but the rest of the reads from the tiling path do not. I think these are just extreme examples of other edge examples above. Let me see how this looks if I correct the consensus and re-align these reads
I guess the rationales of all these examples are similar. For now hifiasm does not have a consensus step when generating contigs. I will try to add it this week. Thanks for point out this problem, which is very helpful for us. BTW, is it possible that you can share some small example data with us? If not I can also find examples by myself. Thank you again for your great help.
OK, I'll gather up some small read sets that reproduce the problem.
All the examples above are cases where the purity of the SNP/INDEL is >95%. I am not even trying to address the cases where it is ambiguous, such as the length of a homopolymer.
Again, it is always adjacent to an edge read, so I don't think it is a generalized consensus issue but a contigging issue.
Thanks!
In the example above with 16bp deletion, none of the reads have that deletion, so how did arise? That's why I wondering if it is contigging issue, rather than consensus issue (at least for the other type of errors I am seeing).
It seem that 16-bp gap (TGTACAAGGGGAAGTC) occurs twice, I guess the overlap to the bottom read just has one copy, which is wrong.
Nice catch! This does appear to be a generalized feature of deletion errors, preceding a read edge. I can find a lot of examples of this.
However, I cannot see an obvious repeat unit in the case of an insertion prior to a read edge.
insertion = AGAAGCTCATAAAAGCCAATGAAGTACCTTTGAACTTCCAGT
Would it be helpful if I just send you pairs of reads that have INDELS adjacent to edge, including the "bad read" scenario that I was initially pursuing?
Yean, that's already very helpful. Thank you a lot! I will let you know when I fix this problem. For the new example, it is hard to say. I need to have a look at some examples to see what really happens, Low complexity introduces weird things.
@kevfengler227 We have made a new release v0.13 which is able to fix most such problems. Besides, the new version supports to output bad regions with low quality in BED format and their related reads (see '--pb-range'). I guess it might be useful for polishing.
Awesome. Thanks! I have polished dozens of assemblies over the past few weeks. I'll do a side-by-side comparison with the latest version and let you know how it goes. How was the issue arising?
Thanks. This is mostly because hifiasm doesn't have a polishing step, it just concatenates corrected reads. However, there are still a few reads that cannot be fully corrected. When hifiasm generates contig from reads, sometimes it randomly selects a bad read. If that happens, all other fully corrected reads do not support the contig. In this example, hifiasm selects the top read to generate contig, which is the only bad one. For now hifiasm does it very carefully, I guess most of these problems have been fixed (although it might be still a few). Here are the QVs of the nearly homologous CHM13 cell line:
v0.12: 54.159 v0.13: 54.726
I am not sure I fully understand. The issue seems to be limited to the ends of reads in the tiling path. This would imply that only the ends of reads cannot be fully corrected? I have not seen any of the above examples occurring in the middle of a read. Also, the reads in question look the same prior to correction. So perhaps the root issue is that large variations at the ends of reads are not being corrected? That is, the end of the read is just really bad.
Yes, the contig in this example should be generated following the blue line, but it follows the red line. So only the top bad read supports the contigs. The root issue is definitely that some reads cannot be fully corrected, and I agree with you that the ends of reads are more difficult to be corrected.
@kevfengler227 We have rereleased v0.13 to output BED files in default and changed the '--pb-range' to '--lowQ'. '--pb-range' is a little bit confused...
OK, thanks. I'll check it out. The latest release does indeed make things a lot better! My polishing days are over.
Great! Thanks for helping us to improve hifiasm : )
So I still see examples where an INDEL error is adjacent to the end of the read that. This seems to be a different type of error than you previous addressed. The deletion is not in any of the reads, so I am not sure how it arises.
Interestingly this consensus error represents a 24bp INDEL, so somehow an additional 24 bp from the start of the end of the read is being added upstream.
Thanks a lot. Does lowQ.bed contain this 24bp INDEL?
yes, the region adjacent to, but not including the INDEL has a score of "92" in the BED file. The read that is flanking the INDEL has a lot of mismatches, so it is definitely a poor read. But I still don't understand how such an INDEL could arise.
This particular example seems more like a case of an error in converting reads in the tiling path into a consensus??
Actually, it would make more sense to show the corrected reads aligned to the assembly
Here the lowQ region appears unrelated to the INDEL. It is associated with a long homopolymer of ambiguous length. The ambiguity of length does correspond to ~24 bp. When the tiling path of reads is converted to a consensus sequence are the actual coordinates used or are they estimated?
Note that the read that is upstream of the INDEL does not contain the ambiguity of the other reads.
OK, this does seem to be a generalized issue where ambiguity within a read creates an INDEL at the start of that read. Here is another example of 6 bp INDEL created.
here is the root cause of the INDEL offset
Interesting...Thanks again. Does v0.13 has more such errors in comparison to v0.12? I guess there might be fewer consensus errors in v0.13. To fix this type of errors, probably hifiasm needs to perform the real consensus in the next version. Let me have a look on Maize assembly first.
well, a consensus step wouldn't fix the ambiguous homopolymer issue. It is only the false INDEL offset I am hoping to fix because that seems like a bug. How could it arise? I don't know how that is being done.
The way I do the polishing takes longer than the assembly, so hopefully that can be avoided, or hopefully you have a faster way in mind (I do minimap2 alignment followed by SNP/INDEL calling from mpileup and then changing FASTA). It is probably not the worth the computation time, but for a few assemblies I try to see how far I can push the consensus accuracy.
No it has the same amount of this type of error in v13 and v12. But the false INDEL upstream of a tiling path read like this appear to be last systematic error I can find. Besides that there are just ambiguous homopolymers which cannot be fixed until HiFi improves.
I see. Let me have a try with maize to see how far we can go. By the way, is it useful if hifiasm output the problematic regions so that we can only polish those small regions, instead of the whole assembly? In this case, the polishing time might be minor.
maybe, but in this case the lowQ region appears to be the root cause of the INDEL offset, but it does not include the INDEL because it occurs upstream of the tiling path read. But if the underlying consensus issue cannot be fixed, then a localized polish sounds like a very interesting solution (if the region is expanded a bit beyond the lowQ region).
You are referring to the B73 HiFi sample that PacBio released, right? I'll assemble that too and point you to some examples.
Yeas, thanks a lot. Looking forward your examples.
OK, I have the examples. Do you want to take this discussion off-line? What is your email?
Thanks, KF
Of course. My email is: hcheng@jimmy.harvard.edu.
So the problem solved finally?
Best, Kun
I guess it is not too bad after v0.14.
It appears that much of the error in hifiasm assemblies is associated with unsupported reads integrated into the tiling path. They are found by aligning the original HiFi reads or corrected reads to the assembly and looking for clusters of SNPs/INDELs. How could this be addressed within hifiasm? How does this read survive correction?
Thanks, KF