chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
517 stars 85 forks source link

Old vs new version gives different result (HiFi and HiC) #194

Open B10inform opened 2 years ago

B10inform commented 2 years ago

Hi, There is huge difference with the haplotype 1 and 2 output with version.

Hifiasm version _0.15.4-r347 : hifiasm -o xx.asm --primary--n-perturb 20000 --f-perturb 0.15 --seed 11 -l3 --n-weight 6 -s 0.55 -k 60 --h1 .fastq.gz --h2 .fastq.gz .fastq
HAP1:362911 HAP2: 365427**

Hifiasm V_0.16.1-r375: hifiasm -o xx.asm --primary--n-perturb 20000 --f-perturb 0.15 --seed 11 -l3 --n-weight 6 -s 0.55 -k 60 --h1 .fastq.gz --h2 .fastq.gz .fastq HAP1:344083 HAP2: 382213

What could be the reason? Could this be looked into.

Thanks

chhylp123 commented 2 years ago

Does your sample have sex chromosomes? What are the expected sizes of the two haplotypes?

B10inform commented 2 years ago

It is a diploid genome, the expected size is around 360MB. Hifiasm version _0.15.4-r347 seems to give good output but the newer version does not.

chhylp123 commented 2 years ago

So there are no sex chromosomes which may lead to different sizes of two haplotypes? The unbalanced two haplotypes are always caused by the mispositioned centromeres, which is very tricky. The changes are as following:

(1) version 0.16.0 introduces a new error correction method so that the contigs tend to be longer and resolve more repeats. (2) version 0.16.1 determines homologous pairs by both all-vs-all contig alignment and Hi-C weight, while previous versions only consider contig alignments.

Could you please have a look if v0.16.0 can give balance two haplotypes? I'd like to figure out which parts lead to this issue. Of course if you can share the bin files with us, it will be more helpful. Thank you in advance.

B10inform commented 2 years ago

Hi chhylp123,

Both v0.16.0 and v0.16.1 gives me similar result. .bin file are too big, i cannot send it here, can you give me your email ? Best

chhylp123 commented 2 years ago

It's hcheng@jimmy.harvard.edu. Thank you so much.

B10inform commented 2 years ago

Hi,

Hope you got the bin files.

B10inform commented 2 years ago

H chhylp123,

Were you able to look into the bin files. I had send it through we-transfer.

Best

chhylp123 commented 2 years ago

Sorry, I missed it. Could please send me *.ec.bin, *.ovlp.reverse.bin, *.ovlp.source.bin, *.hic.lk.bin and *.hic.tlb.bin? I just got one bin file so that I cannot reproduce the results.

B10inform commented 2 years ago

I have send the .bin files. Hope you got it.

chhylp123 commented 2 years ago

Thanks a lot. I will reach out to you soon.

B10inform commented 2 years ago

Hi chhylp123, Any updates.

chhylp123 commented 2 years ago

Sorry for the late reply. I have checked the results but it is tricky to say which one is right. I'm thinking to debug in two ways: 1) could you please check the Hi-C heatmap like this: https://github.com/baozg/phased-assembly-check? As your genome is not too large, probably it won't take too much time. 2) Another way is to have a look at the k-mer plot using KAT or merqury. These tools can tell you if there are 2-copy regions, and where are them. 2-copy regions should be the redundances that should be fixed.

B10inform commented 2 years ago

1) could you please check the Hi-C heatmap I need assemble contigs and HiC reads for Hi-C map.

Hap2 vs raw fastq image

Merqury plot

image

chhylp123 commented 2 years ago

@B10inform Sorry for the late reply. The k-mer plot looks not too bad. I wonder can you share the bin file with me again? Probably I can run purge_dups on top of each haplotype and find potential duplicated regions.

B10inform commented 2 years ago

Hi chhylp123,

Which bin files do you want me to send? there are .lk.bi, .tlb.bin, reverse.bin, source.bin ec.bin or all of them?

chhylp123 commented 2 years ago

Could you please share all bin files? Sorry I just deleted them on my side.

B10inform commented 2 years ago

Hi chhylp123, I have sent them through wetransfer.

What do you think about the merqury Hapmer dbs for trios using reads sequences extracted from the raw HiFi data (original .fastq files) with the Hap1 (HG:A:p) and Hap2 (HG:A:m) information from GFA files.

Thanks

image

chhylp123 commented 2 years ago

Thanks a lot. For merqury plot, it seems you are using the phasing results of hifiasm to evaluate the phased assemblies of hifiasm. So probably it makes little sense.

B10inform commented 2 years ago

Since i don't have the parental reads, what would be the best reads to use?

Thanks

chhylp123 commented 2 years ago

I just have no idea if it makes sense in practice...

chhylp123 commented 2 years ago

Hi @B10inform, I was wondering if you could also share the bin files of v0.16.1 with me? It seems the wired assemblies were generated by v0.16.1. Thank you in advance.

B10inform commented 2 years ago

Hi chhlyp123,
Were you able to run purge_dups on top of each haplotype and find potential duplicated regions?

I have sent the v0.16.1 bin files.

chhylp123 commented 2 years ago

@B10inform, may I ask how do I decompress V0.16.1.asm.ec.bin*? I merged them together by cat and them decompressed the merged file. However, I got a warning extra bytes at beginning or within zipfile.

B10inform commented 2 years ago

Did you try zcat?

B10inform commented 2 years ago

Hi chhlyp123,

Were you able to run purge_dups on the haplotypes (Hifiasm version _0.15.4-r347) and to look at potential duplicated regions?

Thanks

chhylp123 commented 2 years ago

Sure. I will try it this weekend.

B10inform commented 2 years ago

Could you share the software, protocol etc. to look at potential duplicated regions, if it is ok?

Thanks

B10inform commented 2 years ago

Hi chhylp123, These are the plots i see with purge dups, they look weird? What do you think about these plot?

image image

chhylp123 commented 2 years ago

It looks ok. What I will do is to find all potential overlaps between contigs, and then check these overlaps one-by-one to see if some of them are false duplications.

chhylp123 commented 2 years ago

So it is pretty tricky...

B10inform commented 2 years ago

Hi chhylp123,

Any updates, thank you.

chhylp123 commented 2 years ago

Sorry there are too many things... I will reach out to you Thursday.

chhylp123 commented 2 years ago

@B10inform Sorry for the late... I guess I find something. Let me get together the results.

B10inform commented 2 years ago

Hi chhylp123, I was wondering if you were able to find what it was.

Thanks

chhylp123 commented 2 years ago

Sorry for the late reply. For 0.15.4, it is ok. As for 0.16.1, hifiasm mispositioned two homologous contigs to hap2, so that hap2 is larger. What I'm doing for debugging is to find all-vs-all overlaps in hap2 assembly by minimap2. In this case, you could find a long overlap between two contigs (one of these two contigs should be reassigned to hap1 assembly). It is not easy to directly fix it on my side since I don't have Hi-C. If you could follow the poster here (https://github.com/baozg/phased-assembly-check), I do think it is easy to be fixed.