Open paul-bio opened 2 years ago
Are these assemblies much larger than the estimated genome size?
Hi @chhylp123
It seems this species has 18Gb of genome size.
Here are the summary of genome result.
For hap1.p_ctg.fa
For hap2.p_ctg.fa
And for dip.p_ctg.fa
Looks like the coverage information was inferred incorrectly by hifiasm. Could you please have a try with the methods listed here (https://hifiasm.readthedocs.io/en/latest/faq.html#why-the-size-of-primary-assembly-or-partially-phased-assembly-is-much-larger-than-the-estimated-genome-size)?
Thanks @chhylp123 .
I will rerun hifiasm with two ways (-s 0.5 and -s 0.45)
But since this species has highly heterozygosity genome, should I change -l parameters as well?
And can you tell me what is the default parameter of -l ?
Thanks again. From Paul.
Sorry for the late reply. The default value is 1
. You can have a try to adjust it since rerun hifiasm with bin files should be very fast.
Meantime, I tried twice with -s 5 -s0.45 and here is the result
When running with -s 0.45, I got haplotigs which were relatively in same size.But This species is expected to have 1.8~1.9Gb of genome size. and it seems not much duplications are murged enough when compare diploid stats with haploid stat.
In this case, should I run with -l 3 ??
and i uploaded a log file also just in case... nohup.out.txt .
Sorry for the late reply. What if you have a try with https://github.com/dfguan/purge_dups? The default value for -l
is -l3
.
Previously you mentioned, default value for -l is 1
It is bit confusing,, is the default value for -l is -l1 or -l3?
Thanks for reply @chhylp123
It's my bad. The default value for -l
is 3
.
Thanks,
Meantime I run 11 times with different parameters. And seems it would be best if I purge duplications with p_ctg.fa file (diploid type).
In this case, can i perform purge_dups
with p_ctg.fa(diploid) rather than hap1.p_ctg.fa(haploid1)?
Yes. Please note that when you align HiFi reads, it would be better to utilize both p_ctg.gfa
and a_ctg.gfa
.
Thank you so much for your suggestions.
I will try and let you know what the results were.
Hello @chhylp123 ,
When using purge_dups
, I need alternative assembly hap_asm
.
however alternative files comes when there is --primary option is fed.
In question #243 , you said .p_utg.gfa
is a alternative contig file.
Can I use .p_utg.fa
as a a_ctg.fa
(alternative assembly hap_asm)?
Thanks.
No, p_utg.gfa
is the assembly graph. You should use a_ctg.gfa
as the alternative assembly.
Hi @chhylp123 there were different results.
When I run hifiams using command below,
$hifiasm -l1 -s 0.5 --hg-size 1.9g --primary
I got *a_ctg.fa
(alternative contigs) and *p_ctg.fa
(primary contigs) both of which were used for purge_dups.
And I got BUSCO values of C 76.1, S 71.4, D 4.7, F 8.2, M 15.7.
However, I rerun hifiasm without --primary
options and used *p_ctg.fa
and *p_utg.fa
for purge_dups (just for test purposes).
And I got C 93.3, S 78.3, D 15.0, F 3.8, M 2.9.
Is it okay for me to use *p_utg.fa
as an alternative contig?
Purge_dups is a little bit tricky to run. There are a lot of issues at the repo of Purge_dups to discuss how to select appropriate parameters for it, so you need to take care of that. In addition, could you check the coordinates of those duplicated genes, and manually filter out some wrong duplications? BUSCO is not such accurate in some cases.
Hello paul
I'm also about to assemble a species with a very large genome recently. This specie own 17G genome. It's sequencing 30× Now the sequencing company only return 2 cells data, it‘s 4×.
Now, I want to try assembly this species genome. I am worry about my server incompetence.
The RAM is 1T, 80 threads.
Now the hifiasm is running, Only this 4× data are using 626G memory, and I am not sure it whether or not continue using memory. I am worry about if I using 30× hifi data, the server it's ok.
So, Could you tell me how many memory used when you used when you assembly?
Many thx.
4x coverage might take more RAM than 30x. 1TB should be fine.
Hi, chhylp123
Does this mean that our servers can assemble this 17G species if all 30x data return. What about adding hic data? Our hic data is 100x.
Many thx.
I think it should be fine. Low coverage confuses hifiasm so that it cannot identify the right parameters, making the memory requirement extremely large.
Hello, I recently performed a de novo genome assembly using HiFiasm. And I have Hifi sequencing data.
First, thank you let us use this wonderful tool. But when i run for the first time, I got lot of duplicated genes in dip.p_ctg.fa (seems diploid assembled file)
here is the results I got from BUSCO analysis.
For hap1.p_ctg.fa with contigs number of 22,298
For hap2.p_ctg.fa with contigs number of 16,918
For dip.p_ctg.fa with contigs number of 39,216
It seems like genes are lost in hap1.p_ctg.fa but when i look up the dip.p_ctg.fa, I got reliable result.
So I run a purge_hapolotig with dip.p_ctg.fa file, and got like this.
(ps. since my sample was small in size, I used tissues from 6 individuals and pool them to conduct hifi sequencing.)
Do you have any suggestion? It would be best when I use result from dip.p_ctg.fa...
Since the relative species had heterozygosity rate of 2.41%, should I rerun with -l3 and lower -s value to 0.75->0.55 ??
Thanks anyway.