chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
535 stars 87 forks source link

Error in genome consensus #140

Closed tbrown91 closed 3 years ago

tbrown91 commented 3 years ago

Hello,

We have a collaborator who is working on gene annotation and loss and found a number of genes which were falsely annotated as lost or mutated as the output of hifiasm. Mapping the ccs reads back to the assembly, we see that these are often single bp insertions in homopolymer runs, which are only supported by a small number of reads. I ran hifiasm is hic mode, so have both haplotypes and was version 0.15.1-r331.

I have attached two examples with single errors which are present in both haplotypes, and a third location where there are a number of insertions in one haplotype and not the other.

Is there way to improve the consensus calling using hifiasm's built-in arguments?

First location hap1: image hap2: image

Second location hap1: image hap2: image

Finally an example of where there are a number of single bp insertions in the reference, this time only present in hap1. Hap2 looks to have the correct consensus sequence. hap1: image hap2: image

Any advice would be much appreciated.

Many thanks,

Tom

chhylp123 commented 3 years ago

Just make sure: do you mean here is a consensus error, and the assembly has base 'G'? image

tbrown91 commented 3 years ago

Yes, so in the assembly there is an extra G inserted in this position, which is only supported by a few reads

chhylp123 commented 3 years ago

Just curious: why non-G is right and G is wrong?

tbrown91 commented 3 years ago

If we are to believe pacbio, then the remaining sequencing errors are mostly those in homopolymer stretches. In this case we have 28 reads with 7 G's and 5 reads with 8 G's, which is why we believe the assembly should have 7. On a biological level, this created a gene mutation, which was not present in an illumina assembly.

chhylp123 commented 3 years ago

I see, thanks a lot. It looks like a somatic mutation? Hifiasm is haplotype-aware and this mutation is supported by enough reads. In this case hifiasm thinks 7 G's is one haplotype and 8 G's is another. So I guess it is not a consensus error. Could please show the corresponding subgraph of these reads in p_utg or r_utg?

tbrown91 commented 3 years ago

I guess it looks pretty clean in the graphs

p_utg: image

r_utg: image

lh3 commented 3 years ago

These look like consensus errors but that is the best the current version of hifiasm can do. We will try to improve but it will take time. For now, you may consider to use Illumina to only fix long homopolyers in easy regions in theory. In practice, this can be very tricky.

PS: you may also try to fix these based on HiFi alignment. Be careful of haplotype differences. This is also tricky.

tbrown91 commented 3 years ago

Thank you for the update. We will go forward with polishing to try and remove these errors. We will report back if these errors go away in future releases. Thank you for the tool and hard work!