Closed tbrown91 closed 3 years ago
Just make sure: do you mean here is a consensus error, and the assembly has base 'G'?
Yes, so in the assembly there is an extra G inserted in this position, which is only supported by a few reads
Just curious: why non-G is right and G is wrong?
If we are to believe pacbio, then the remaining sequencing errors are mostly those in homopolymer stretches. In this case we have 28 reads with 7 G's and 5 reads with 8 G's, which is why we believe the assembly should have 7. On a biological level, this created a gene mutation, which was not present in an illumina assembly.
I see, thanks a lot. It looks like a somatic mutation? Hifiasm is haplotype-aware and this mutation is supported by enough reads. In this case hifiasm thinks 7 G's is one haplotype and 8 G's is another. So I guess it is not a consensus error. Could please show the corresponding subgraph of these reads in p_utg or r_utg?
I guess it looks pretty clean in the graphs
p_utg:
r_utg:
These look like consensus errors but that is the best the current version of hifiasm can do. We will try to improve but it will take time. For now, you may consider to use Illumina to only fix long homopolyers in easy regions in theory. In practice, this can be very tricky.
PS: you may also try to fix these based on HiFi alignment. Be careful of haplotype differences. This is also tricky.
Thank you for the update. We will go forward with polishing to try and remove these errors. We will report back if these errors go away in future releases. Thank you for the tool and hard work!
Hello,
We have a collaborator who is working on gene annotation and loss and found a number of genes which were falsely annotated as lost or mutated as the output of hifiasm. Mapping the ccs reads back to the assembly, we see that these are often single bp insertions in homopolymer runs, which are only supported by a small number of reads. I ran hifiasm is hic mode, so have both haplotypes and was version 0.15.1-r331.
I have attached two examples with single errors which are present in both haplotypes, and a third location where there are a number of insertions in one haplotype and not the other.
Is there way to improve the consensus calling using hifiasm's built-in arguments?
First location hap1: hap2:
Second location hap1: hap2:
Finally an example of where there are a number of single bp insertions in the reference, this time only present in hap1. Hap2 looks to have the correct consensus sequence. hap1: hap2:
Any advice would be much appreciated.
Many thanks,
Tom