chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
548 stars 87 forks source link

Assemble a social insect species which has only one set of chromosomes lacking of paternal chromosome (like a gametes genome) #287

Open yshcai opened 2 years ago

yshcai commented 2 years ago

I assemble a haploid insect genome by using hifiasm v0.16.1-r375. It should be noted that my species has only one set of chromosomes lacking of paternal chromosome because it develops from one unfertilized egg which produces less than one hundred offsprings. These offsprings have one set of chromosomes, and this biological phenomenon is parthenogenesis which is common in social insects such as bees. I sampled these offsprings developed from one unfertilized egg to sequence in order to decrease heterozygosity rate. Hence, I think it nearly homozygous sample. I also performed k-mer analysis by using jellyfish (K=17) and ran Genomescope2 in haploid mode haploidy_linear_plot and diploid mode. diploidy_linear_plot

The results indicate this sample have low heterozygosity and the estimated genome size is 121Mb.

However, the kmer analysis using illumina data showed the estimated genome is 135M and have one peak at 64 depth. image

OK, I don't pay too much attention on different results between illumina short reads and hifi reads. I assemble this haploidy species genome by using hifiasm (hifiasm -o Mcin_WL302.asm --primary -t32 -l0 Mcin_WL302.ccs.fastq.gz) and the log file is hifiasm_asm.log.

I think this k-mer plot looks wired in log file and I have some questions as follows: 1) I notice there is another peak very smaller at 30, but I don't know if this is a heterozygous read coverage because hifiasm prints this [M::ha_pt_gen] peak_hom: 121; peak_het: -1; 2) Which homozygous read coverage I should select? There is a new homozygous read coverage after each round for reads correction [M::ha_ft_gen] peak_hom: 117; peak_het: -1, [M::ha_pt_gen] peak_hom: 114; peak_het: 30, [M::ha_pt_gen] peak_hom: 115; peak_het: -1, [M::ha_pt_gen] peak_hom: 121; peak_het: -1, [M::ha_pt_gen] peak_hom: 121; peak_het: -1 in log file. I set --hom-cov 117 and --hom-cov 121 and both the primary assembly size are nearly ~150Mb. The busco evaluation showed both the result are the same C:99.5%[S:97.0%,D:2.5%],F:0.1%,M:0.4%,n:1367. So how should I tune the option such as -D, -l and so on?

I feel very confused. Please give some advice, I really appreciate you very much.

chhylp123 commented 2 years ago

From my point of view, the assembly looks not bad. Could you please let me know which metrics you think is not good? The peak_hom selected by hifiasm itself should be correct, so you don't need to change it. As for the assembly size, the estimated size from reads often tends to be smaller. Based on the BUSCO scores, I guess the 150Mb might be correct?

yshcai commented 2 years ago

Thank you for your reply! The assembly size is 152Mb and Contig N50 is 8.8Mb with option -l0. The result looks good. However, what makes me confused is that this k-mer plot in log file has a small peak at 30, I don't know if this is real heterozygous read coverage because hifiasm doesn't identify actually it as heterozygous peak (peak_het: -1). If this small peak is a heterozygous peak, I think I shouldn't use the option -l0 which is suitable for homozyous sample like CHM13. For homozygous samples, there should be one peak around read coverage. I just want to know what cause this small peak produce. Maybe it's a error.

chhylp123 commented 2 years ago

Might be somatic mutations or the remaining heterozygous regions. In most cases, genomes should not fully homozygous.

yshcai commented 2 years ago

I see. Thanks a lot!