Open bbalog87 opened 2 years ago
Did hifiasm successfully produce some contigs?
Yes @chhylp123 , hifiasm produced contigs but with a few issues making the assembly suboptimal:
Primary assembly statistics:
# of contigs: 2176
Total length (nt): 1367163925
Longest sequence (nt): 52058640
Shortest sequence (nt): 10324
Mean sequence length (nt): 628292
Median sequence length (nt): 96502
N50 sequence length (nt): 18173130
L50 sequence count: 24
Scores in BUSCO format: C:97.5%[S:71.3%,D:26.2%],F:0.5%,M:2.0%,n:3640
Hap1 assembly statistics:
# of contigs: 2742
Total length (nt): 1376558787
Longest sequence (nt): 49496490
Shortest sequence (nt): 10324
Mean sequence length (nt): 502027
Median sequence length (nt): 92239
N50 sequence length (nt): 9898874
L50 sequence count: 39
Scores in BUSCO format: C:97.2%[S:70.1%,D:27.1%],F:0.5%,M:2.3%,n:3640
Hap2 assembly statistics:
# of contigs: 1773
Total length (nt): 1015158721
Longest sequence (nt): 28026528
Shortest sequence (nt): 10324
Mean sequence length (nt): 572566
Median sequence length (nt): 102852
N50 sequence length (nt): 5249658
L50 sequence count: 48
Scores in BUSCO format: C:91.0%[S:82.1%,D:8.9%],F:0.7%,M:8.3%,n:3640
I think excluding low coverage k-mers (depth <26) from the assembly process could fix the issue of imbalanced haplotypes. However, I'm not figuring out which default hifiasm parameter/option I should set to that end.
The peak at depth 16 looks strange. It could be contamination. I would extract contigs with ~16X coverage and study what is happening. Alternatively, you can downsample input reads to ~40X coverage and reassemble.
Also note that k-mer based methods often underestimate the genome size. That is not ground truth.
Is the coverage information of contigs recorded in one of the hifiasm output files? Otherwise, I'll need to map HiFi-reads to the assembly to infer the contigs coverage.
Yes, the rd:i
in gfa is the coverage information (see https://hifiasm.readthedocs.io/en/latest/interpreting-output.html#output-file-formats).
I have a confusion. The coverage information represents the kmer coverage information. The kmer coverage does not equal the reads coverage information according to the genomescope paper. Could you extract the kmer coverage information directly depending on the reads coverage information?
Sorry for the late reply. By the coverage information, do you mean rd:i
? It is the local information of each unitig/contig, while the k-mer information is the global information at the whole genome-scale. I personally think the reads coverage information should be more accurate.
Dear developers of this great assembler,
I have a question concerning a weird k-mers histogram with 3 peaks for a definitely diploid fish genome. Please find the attached k-mers histogram as output by hifiasm log: hifiasm_kmers_histo.txt.
As you can see, it has 3 peaks, which shouldn't normally be for the diploid genome. The heterozygosity rate is estimated to be 2.2% as computed by KAT, and the heterozygous peak is at 63, while the homozygous peak is at 131. I have set the --hom-cov accordingly, given the high heterozygosity of this genome. The k-mers with a depth < 25 are most likely from contamination and/or erroneous k-mers. Is it somehow possible to exclude these k-mers through parameters tuning in hifiasm prior to assembly? Hifiasm is predicting the heterozygous k-mer peak at 17, which is wrong since the heterozygous peak is 63 according to jellyfish k-mer histogram: 27mers-histogram_clean_HiFi.pdf
The assembly command :
Best, Julien