chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
548 stars 87 forks source link

So many duplications. #292

Open paul-bio opened 2 years ago

paul-bio commented 2 years ago

Hello, I recently performed a de novo genome assembly using HiFiasm. And I have Hifi sequencing data.

First, thank you let us use this wonderful tool. But when i run for the first time, I got lot of duplicated genes in dip.p_ctg.fa (seems diploid assembled file)

here is the results I got from BUSCO analysis.

For hap1.p_ctg.fa image with contigs number of 22,298

For hap2.p_ctg.fa image with contigs number of 16,918

For dip.p_ctg.fa image with contigs number of 39,216

It seems like genes are lost in hap1.p_ctg.fa but when i look up the dip.p_ctg.fa, I got reliable result.

So I run a purge_hapolotig with dip.p_ctg.fa file, and got like this. image

(ps. since my sample was small in size, I used tissues from 6 individuals and pool them to conduct hifi sequencing.)

Do you have any suggestion? It would be best when I use result from dip.p_ctg.fa...

Since the relative species had heterozygosity rate of 2.41%, should I rerun with -l3 and lower -s value to 0.75->0.55 ??

Thanks anyway.

chhylp123 commented 2 years ago

Are these assemblies much larger than the estimated genome size?

paul-bio commented 2 years ago

Hi @chhylp123

It seems this species has 18Gb of genome size.

Here are the summary of genome result.

For hap1.p_ctg.fa image

For hap2.p_ctg.fa image

And for dip.p_ctg.fa image

chhylp123 commented 2 years ago

Looks like the coverage information was inferred incorrectly by hifiasm. Could you please have a try with the methods listed here (https://hifiasm.readthedocs.io/en/latest/faq.html#why-the-size-of-primary-assembly-or-partially-phased-assembly-is-much-larger-than-the-estimated-genome-size)?

paul-bio commented 2 years ago

Thanks @chhylp123 .

I will rerun hifiasm with two ways (-s 0.5 and -s 0.45)

But since this species has highly heterozygosity genome, should I change -l parameters as well?

And can you tell me what is the default parameter of -l ?

Thanks again. From Paul.

chhylp123 commented 2 years ago

Sorry for the late reply. The default value is 1. You can have a try to adjust it since rerun hifiasm with bin files should be very fast.

paul-bio commented 2 years ago

Meantime, I tried twice with -s 5 -s0.45 and here is the result

image

When running with -s 0.45, I got haplotigs which were relatively in same size.But This species is expected to have 1.8~1.9Gb of genome size. and it seems not much duplications are murged enough when compare diploid stats with haploid stat.

In this case, should I run with -l 3 ??

and i uploaded a log file also just in case... nohup.out.txt .

chhylp123 commented 2 years ago

Sorry for the late reply. What if you have a try with https://github.com/dfguan/purge_dups? The default value for -l is -l3.

paul-bio commented 2 years ago

Previously you mentioned, default value for -l is 1 image

It is bit confusing,, is the default value for -l is -l1 or -l3?

Thanks for reply @chhylp123

chhylp123 commented 2 years ago

It's my bad. The default value for -l is 3.

paul-bio commented 2 years ago

Thanks,

image

Meantime I run 11 times with different parameters. And seems it would be best if I purge duplications with p_ctg.fa file (diploid type).

In this case, can i perform purge_dups with p_ctg.fa(diploid) rather than hap1.p_ctg.fa(haploid1)?

chhylp123 commented 2 years ago

Yes. Please note that when you align HiFi reads, it would be better to utilize both p_ctg.gfa and a_ctg.gfa.

paul-bio commented 2 years ago

Thank you so much for your suggestions.

I will try and let you know what the results were.

paul-bio commented 2 years ago

Hello @chhylp123 ,

When using purge_dups, I need alternative assembly hap_asm.

however alternative files comes when there is --primary option is fed. In question #243 , you said .p_utg.gfa is a alternative contig file.

Can I use .p_utg.fa as a a_ctg.fa (alternative assembly hap_asm)?

Thanks.

chhylp123 commented 2 years ago

No, p_utg.gfa is the assembly graph. You should use a_ctg.gfa as the alternative assembly.

paul-bio commented 2 years ago

Hi @chhylp123 there were different results.

When I run hifiams using command below, $hifiasm -l1 -s 0.5 --hg-size 1.9g --primary

I got *a_ctg.fa (alternative contigs) and *p_ctg.fa (primary contigs) both of which were used for purge_dups.

And I got BUSCO values of C 76.1, S 71.4, D 4.7, F 8.2, M 15.7.

However, I rerun hifiasm without --primary options and used *p_ctg.fa and *p_utg.fa for purge_dups (just for test purposes).

And I got C 93.3, S 78.3, D 15.0, F 3.8, M 2.9.

Is it okay for me to use *p_utg.fa as an alternative contig?

chhylp123 commented 2 years ago

Purge_dups is a little bit tricky to run. There are a lot of issues at the repo of Purge_dups to discuss how to select appropriate parameters for it, so you need to take care of that. In addition, could you check the coordinates of those duplicated genes, and manually filter out some wrong duplications? BUSCO is not such accurate in some cases.

zhang144999 commented 2 years ago

Hello paul

I'm also about to assemble a species with a very large genome recently. This specie own 17G genome. It's sequencing 30× Now the sequencing company only return 2 cells data, it‘s 4×. Now, I want to try assembly this species genome. I am worry about my server incompetence.
The RAM is 1T, 80 threads. Now the hifiasm is running, Only this 4× data are using 626G memory, and I am not sure it whether or not continue using memory. I am worry about if I using 30× hifi data, the server it's ok. So, Could you tell me how many memory used when you used when you assembly?

Many thx.

chhylp123 commented 2 years ago

4x coverage might take more RAM than 30x. 1TB should be fine.

zhang144999 commented 2 years ago

Hi, chhylp123

Does this mean that our servers can assemble this 17G species if all 30x data return. What about adding hic data? Our hic data is 100x.

Many thx.

chhylp123 commented 2 years ago

I think it should be fine. Low coverage confuses hifiasm so that it cannot identify the right parameters, making the memory requirement extremely large.