Weird k-mers histogram : How to filter out /exclude k-mers under a depth threshold with hifiasm

bbalog87 commented 2 years ago

Dear developers of this great assembler,

I have a question concerning a weird k-mers histogram with 3 peaks for a definitely diploid fish genome. Please find the attached k-mers histogram as output by hifiasm log: hifiasm_kmers_histo.txt.

As you can see, it has 3 peaks, which shouldn't normally be for the diploid genome. The heterozygosity rate is estimated to be 2.2% as computed by KAT, and the heterozygous peak is at 63, while the homozygous peak is at 131. I have set the --hom-cov accordingly, given the high heterozygosity of this genome. The k-mers with a depth < 25 are most likely from contamination and/or erroneous k-mers. Is it somehow possible to exclude these k-mers through parameters tuning in hifiasm prior to assembly? Hifiasm is predicting the heterozygous k-mer peak at 17, which is wrong since the heterozygous peak is 63 according to jellyfish k-mer histogram: 27mers-histogram_clean_HiFi.pdf

The assembly command :

hifiasm -o AB.asm \
          -t 91 \
      --hg-size 1200m \
      --hom-cov 131 -l3  -s 0.35 \
      --n-perturb 75000 \
      --f-perturb 0.15 \
      --n-weight 5 \
      --h1 $HICreads1 \
      --h2 $HICreads2 \
      $AB

Best, Julien

chhylp123 commented 2 years ago

Did hifiasm successfully produce some contigs?

bbalog87 commented 2 years ago

Yes @chhylp123 , hifiasm produced contigs but with a few issues making the assembly suboptimal:

The assembly size is much larger (~200 Mb) than the estimated genome size (1200 Mb)
The two haplotypes are extremely unbalanced with more than a 25% length difference
High level (>25%) of duplication rate in Hap1 and Primary assemblies

Primary assembly statistics:


     # of contigs:    2176
        Total length (nt):    1367163925
        Longest sequence (nt):    52058640
        Shortest sequence (nt):    10324
        Mean sequence length (nt):    628292
        Median sequence length (nt):    96502
        N50 sequence length (nt):    18173130
        L50 sequence count:    24
  Scores in BUSCO format:    C:97.5%[S:71.3%,D:26.2%],F:0.5%,M:2.0%,n:3640

Hap1 assembly statistics:

  # of contigs:    2742
        Total length (nt):    1376558787
        Longest sequence (nt):    49496490
        Shortest sequence (nt):    10324
        Mean sequence length (nt):    502027
        Median sequence length (nt):    92239
        N50 sequence length (nt):    9898874
        L50 sequence count:   39
     Scores in BUSCO format:    C:97.2%[S:70.1%,D:27.1%],F:0.5%,M:2.3%,n:3640

Hap2 assembly statistics:

        # of contigs:    1773
        Total length (nt):    1015158721
        Longest sequence (nt):    28026528
        Shortest sequence (nt):    10324
        Mean sequence length (nt):    572566
        Median sequence length (nt):    102852
        N50 sequence length (nt):    5249658
        L50 sequence count:    48
     Scores in BUSCO format:    C:91.0%[S:82.1%,D:8.9%],F:0.7%,M:8.3%,n:3640

bbalog87 commented 2 years ago

I think excluding low coverage k-mers (depth <26) from the assembly process could fix the issue of imbalanced haplotypes. However, I'm not figuring out which default hifiasm parameter/option I should set to that end.

lh3 commented 2 years ago

The peak at depth 16 looks strange. It could be contamination. I would extract contigs with ~16X coverage and study what is happening. Alternatively, you can downsample input reads to ~40X coverage and reassemble.

Also note that k-mer based methods often underestimate the genome size. That is not ground truth.

bbalog87 commented 2 years ago

Is the coverage information of contigs recorded in one of the hifiasm output files? Otherwise, I'll need to map HiFi-reads to the assembly to infer the contigs coverage.

chhylp123 commented 2 years ago

Yes, the rd:i in gfa is the coverage information (see https://hifiasm.readthedocs.io/en/latest/interpreting-output.html#output-file-formats).

Jinke233 commented 2 years ago

I have a confusion. The coverage information represents the kmer coverage information. The kmer coverage does not equal the reads coverage information according to the genomescope paper. Could you extract the kmer coverage information directly depending on the reads coverage information?

chhylp123 commented 2 years ago

Sorry for the late reply. By the coverage information, do you mean rd:i? It is the local information of each unitig/contig, while the k-mer information is the global information at the whole genome-scale. I personally think the reads coverage information should be more accurate.

chhylp123 / hifiasm

Weird k-mers histogram : How to filter out /exclude k-mers under a depth threshold with hifiasm #235