bad reads in tiling path

kevfengler227 commented 3 years ago

It appears that much of the error in hifiasm assemblies is associated with unsupported reads integrated into the tiling path. They are found by aligning the original HiFi reads or corrected reads to the assembly and looking for clusters of SNPs/INDELs. How could this be addressed within hifiasm? How does this read survive correction?

Thanks, KF

kevfengler227 commented 3 years ago

above are the corrected reads from the p_ctg.noseq.gfa aligned to the assembly

Here is a view of a whole chromosome. 4 "bad" reads are causing most of the trouble

chhylp123 commented 3 years ago

Does these reads consist of large number of homopolymer errors so hifiasm failed to correct them?

kevfengler227 commented 3 years ago

The portion of the "bad read" appears to have low complexity, but not necessarily just homopolymers. This portion is always at the beginning of the read and only spans a ~100-200 bp. The reads at these loci are not getting heavily corrected. It looks the same with HiFi or corrected. Here is a zoomed in example

kevfengler227 commented 3 years ago

Interestingly, the remaining consensus error appears to be associated with the beginning edge of a tiling read. Just looked at dozens of other examples that look like this (and INDEL immediately before a read)???

kevfengler227 commented 3 years ago

Maybe the "bad read" clusters are just extreme examples of this edge error effect?

There is a weird GA...TC thing going on at this sites?

chhylp123 commented 3 years ago

I guess all these bad reads of assembly should be there but cannot be corrected, right?

kevfengler227 commented 3 years ago

yes, the contig assembly is correct, except for the consensus error brought in by the portion of the bad read

chhylp123 commented 3 years ago

Thanks a lot. In the above examples, are these reads roughly aligned to their corresponding positions at p_ctg? For example, if we extract read A from the position 5000 of a contig and align A back to the contig, is the alignment position of A still around 5000?

kevfengler227 commented 3 years ago

Yes, the coordinates are the same.

chhylp123 commented 3 years ago

May I ask is the last line of these two examples a read? If it is a read, looks like there are multiple reads that consent with each other, but hifiasm selects the only wrong one?

kevfengler227 commented 3 years ago

yes. All of these are reads from the p_ctg.gfa, and the bottom read is the one that agrees with consensus, but the rest of the reads from the tiling path do not. I think these are just extreme examples of other edge examples above. Let me see how this looks if I correct the consensus and re-align these reads

chhylp123 commented 3 years ago

I guess the rationales of all these examples are similar. For now hifiasm does not have a consensus step when generating contigs. I will try to add it this week. Thanks for point out this problem, which is very helpful for us. BTW, is it possible that you can share some small example data with us? If not I can also find examples by myself. Thank you again for your great help.

kevfengler227 commented 3 years ago

OK, I'll gather up some small read sets that reproduce the problem.

All the examples above are cases where the purity of the SNP/INDEL is >95%. I am not even trying to address the cases where it is ambiguous, such as the length of a homopolymer.

Again, it is always adjacent to an edge read, so I don't think it is a generalized consensus issue but a contigging issue.

Thanks!

kevfengler227 commented 3 years ago

In the example above with 16bp deletion, none of the reads have that deletion, so how did arise? That's why I wondering if it is contigging issue, rather than consensus issue (at least for the other type of errors I am seeing).

chhylp123 commented 3 years ago

It seem that 16-bp gap (TGTACAAGGGGAAGTC) occurs twice, I guess the overlap to the bottom read just has one copy, which is wrong.

kevfengler227 commented 3 years ago

Nice catch! This does appear to be a generalized feature of deletion errors, preceding a read edge. I can find a lot of examples of this.

However, I cannot see an obvious repeat unit in the case of an insertion prior to a read edge.

insertion = AGAAGCTCATAAAAGCCAATGAAGTACCTTTGAACTTCCAGT

Would it be helpful if I just send you pairs of reads that have INDELS adjacent to edge, including the "bad read" scenario that I was initially pursuing?

chhylp123 commented 3 years ago

Yean, that's already very helpful. Thank you a lot! I will let you know when I fix this problem. For the new example, it is hard to say. I need to have a look at some examples to see what really happens, Low complexity introduces weird things.

chhylp123 commented 3 years ago

@kevfengler227 We have made a new release v0.13 which is able to fix most such problems. Besides, the new version supports to output bad regions with low quality in BED format and their related reads (see '--pb-range'). I guess it might be useful for polishing.

kevfengler227 commented 3 years ago

Awesome. Thanks! I have polished dozens of assemblies over the past few weeks. I'll do a side-by-side comparison with the latest version and let you know how it goes. How was the issue arising?

chhylp123 commented 3 years ago

Thanks. This is mostly because hifiasm doesn't have a polishing step, it just concatenates corrected reads. However, there are still a few reads that cannot be fully corrected. When hifiasm generates contig from reads, sometimes it randomly selects a bad read. If that happens, all other fully corrected reads do not support the contig. In this example, hifiasm selects the top read to generate contig, which is the only bad one. For now hifiasm does it very carefully, I guess most of these problems have been fixed (although it might be still a few). Here are the QVs of the nearly homologous CHM13 cell line:

v0.12: 54.159 v0.13: 54.726

kevfengler227 commented 3 years ago

I am not sure I fully understand. The issue seems to be limited to the ends of reads in the tiling path. This would imply that only the ends of reads cannot be fully corrected? I have not seen any of the above examples occurring in the middle of a read. Also, the reads in question look the same prior to correction. So perhaps the root issue is that large variations at the ends of reads are not being corrected? That is, the end of the read is just really bad.

chhylp123 commented 3 years ago

Untitled Yes, the contig in this example should be generated following the blue line, but it follows the red line. So only the top bad read supports the contigs. The root issue is definitely that some reads cannot be fully corrected, and I agree with you that the ends of reads are more difficult to be corrected.

chhylp123 commented 3 years ago

@kevfengler227 We have rereleased v0.13 to output BED files in default and changed the '--pb-range' to '--lowQ'. '--pb-range' is a little bit confused...

kevfengler227 commented 3 years ago

OK, thanks. I'll check it out. The latest release does indeed make things a lot better! My polishing days are over.

chhylp123 commented 3 years ago

Great! Thanks for helping us to improve hifiasm : )

kevfengler227 commented 3 years ago

So I still see examples where an INDEL error is adjacent to the end of the read that. This seems to be a different type of error than you previous addressed. The deletion is not in any of the reads, so I am not sure how it arises.

Interestingly this consensus error represents a 24bp INDEL, so somehow an additional 24 bp from the start of the end of the read is being added upstream.

chhylp123 commented 3 years ago

Thanks a lot. Does lowQ.bed contain this 24bp INDEL?

kevfengler227 commented 3 years ago

yes, the region adjacent to, but not including the INDEL has a score of "92" in the BED file. The read that is flanking the INDEL has a lot of mismatches, so it is definitely a poor read. But I still don't understand how such an INDEL could arise.

This particular example seems more like a case of an error in converting reads in the tiling path into a consensus??

kevfengler227 commented 3 years ago

Actually, it would make more sense to show the corrected reads aligned to the assembly

Here the lowQ region appears unrelated to the INDEL. It is associated with a long homopolymer of ambiguous length. The ambiguity of length does correspond to ~24 bp. When the tiling path of reads is converted to a consensus sequence are the actual coordinates used or are they estimated?

Note that the read that is upstream of the INDEL does not contain the ambiguity of the other reads.

kevfengler227 commented 3 years ago

OK, this does seem to be a generalized issue where ambiguity within a read creates an INDEL at the start of that read. Here is another example of 6 bp INDEL created.

kevfengler227 commented 3 years ago

here is the root cause of the INDEL offset

chhylp123 commented 3 years ago

Interesting...Thanks again. Does v0.13 has more such errors in comparison to v0.12? I guess there might be fewer consensus errors in v0.13. To fix this type of errors, probably hifiasm needs to perform the real consensus in the next version. Let me have a look on Maize assembly first.

kevfengler227 commented 3 years ago

well, a consensus step wouldn't fix the ambiguous homopolymer issue. It is only the false INDEL offset I am hoping to fix because that seems like a bug. How could it arise? I don't know how that is being done.

The way I do the polishing takes longer than the assembly, so hopefully that can be avoided, or hopefully you have a faster way in mind (I do minimap2 alignment followed by SNP/INDEL calling from mpileup and then changing FASTA). It is probably not the worth the computation time, but for a few assemblies I try to see how far I can push the consensus accuracy.

No it has the same amount of this type of error in v13 and v12. But the false INDEL upstream of a tiling path read like this appear to be last systematic error I can find. Besides that there are just ambiguous homopolymers which cannot be fixed until HiFi improves.

chhylp123 commented 3 years ago

I see. Let me have a try with maize to see how far we can go. By the way, is it useful if hifiasm output the problematic regions so that we can only polish those small regions, instead of the whole assembly? In this case, the polishing time might be minor.

kevfengler227 commented 3 years ago

maybe, but in this case the lowQ region appears to be the root cause of the INDEL offset, but it does not include the INDEL because it occurs upstream of the tiling path read. But if the underlying consensus issue cannot be fixed, then a localized polish sounds like a very interesting solution (if the region is expanded a bit beyond the lowQ region).

kevfengler227 commented 3 years ago

You are referring to the B73 HiFi sample that PacBio released, right? I'll assemble that too and point you to some examples.

chhylp123 commented 3 years ago

Yeas, thanks a lot. Looking forward your examples.

kevfengler227 commented 3 years ago

OK, I have the examples. Do you want to take this discussion off-line? What is your email?

Thanks, KF

chhylp123 commented 3 years ago

Of course. My email is: hcheng@jimmy.harvard.edu.

xiekunwhy commented 3 years ago

So the problem solved finally?

Best, Kun

chhylp123 commented 3 years ago

I guess it is not too bad after v0.14.

chhylp123 / hifiasm

bad reads in tiling path #42