chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
542 stars 87 forks source link

HG002 trio assembly: are 2-3% switching error rates for phased haplotypes too high #544

Open liu-xingliang opened 1 year ago

liu-xingliang commented 1 year ago

I applied hifiasm (v0.19.5-r592) to a HG002 30X dataset sequenced using PacBio Revio and got 2-3% switching error rate for primary contigs and two phased haplotypes contigs:

# switch/hamming error rate for primary contigs 
awk '/^S/{print ">"$2;print $3}' HG002.full.asm/HG002.full.asm.bp.p_ctg.gfa > HG002.full.asm/HG002.full.asm.bp.p_ctg.fa
~/softwares/yak/yak trioeval pat.yak mat.yak HG002.full.asm/HG002.full.asm.bp.p_ctg.fa -t 12 > HG002.full.asm/HG002.full.asm.bp.p_ctg.trioeval.out
# tail -n 3 HG002.full.asm/HG002.full.asm.bp.p_ctg.trioeval.out
# W 76954   3168677 0.024286
# H 720838  3168937 0.227470
# N 2110637 1058308 0.333962

# switch/hamming error rate for phased haplotypes:

awk '/^S/{print ">"$2;print $3}' HG002.full.asm/HG002.full.asm.dip.hap1.p_ctg.gfa > HG002.full.asm/HG002.full.asm.dip.hap1.p_ctg.gfa.fa
~/softwares/yak/yak trioeval pat.yak mat.yak HG002.full.asm/HG002.full.asm.dip.hap1.p_ctg.gfa.fa -t 12 > HG002.full.asm/HG002.full.asm.dip.hap1.p_ctg.trioeval.out
# tail -n 3 HG002.full.asm/HG002.full.asm.dip.hap1.p_ctg.trioeval.out
# W 86232   3612758 0.023869
# H 81960   3613135 0.022684
# N 3530950 82206   0.022752

awk '/^S/{print ">"$2;print $3}' HG002.full.asm/HG002.full.asm.dip.hap2.p_ctg.gfa > HG002.full.asm/HG002.full.asm.dip.hap2.p_ctg.gfa.fa
~/softwares/yak/yak trioeval pat.yak mat.yak HG002.full.asm/HG002.full.asm.dip.hap2.p_ctg.gfa.fa -t 12 > HG002.full.asm/HG002.full.asm.dip.hap2.p_ctg.trioeval.out
# tail -n 3 HG002.full.asm/HG002.full.asm.dip.hap2.p_ctg.trioeval.out
# W 61826   1904199 0.032468
# H 48191   1904521 0.025303
# N 48593   1855931 0.025515

My question is: are above switching error rates considered too high. Should they all be below 1% to be safe.

Thank you very much for your help!

Details of the trio-assembly:

# covert parental NGS bams into fastqs:
# paternal NGS data:
# https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG003_NA24149_father/NIST_HiSeq_HG003_Homogeneity-12389378/HG003Run01-13262252/HG003Run01_S1.bam
# maternal NGS data:
# https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG004_NA24143_mother/NIST_HiSeq_HG004_Homogeneity-14572558/HG004Run01-15133132/HG004Run01_S1.bam

# samtools --version
# samtools 1.13
# Using htslib 1.13
# Copyright (C) 2021 Genome Research Ltd.
# 
# Samtools compilation details:
#     Features:       build=configure curses=no 
#     CC:             /var/lib/bamboo/bamboo-agent-home/xml-data/build-dir/RBS-KESTREL12THIRDPARTYALL-JOB1/artifacts/thirdparty.prebuilt/gcc/gcc_10.3.0/libc_2.17/gcc_10.3.0/bin/gcc
#     CPPFLAGS:       -I/var/lib/bamboo/bamboo-agent-home/xml-data/build-dir/RBS-KESTREL12THIRDPARTYALL-JOB1/_output/modulebuilds/repos/smrtlink-thirdparty-src/thirdparty.src/zlib/zlib_1.2.11/_output/install/include
#     CFLAGS:         -Wall -g -O2
#     LDFLAGS:        -L/var/lib/bamboo/bamboo-agent-home/xml-data/build-dir/RBS-KESTREL12THIRDPARTYALL-JOB1/_output/modulebuilds/repos/smrtlink-thirdparty-src/thirdparty.src/zlib/zlib_1.2.11/_output/install/lib
#     HTSDIR:         
#     LIBS:           -lz
#     CURSES_LIB:     
# 
# HTSlib compilation details:
#     Features:       build=configure plugins=no libcurl=no S3=no GCS=no libdeflate=no lzma=no bzip2=no htscodecs=1.1.1
#     CC:             /var/lib/bamboo/bamboo-agent-home/xml-data/build-dir/RBS-KESTREL12THIRDPARTYALL-JOB1/artifacts/thirdparty.prebuilt/gcc/gcc_10.3.0/libc_2.17/gcc_10.3.0/bin/gcc
#     CPPFLAGS:       -I/var/lib/bamboo/bamboo-agent-home/xml-data/build-dir/RBS-KESTREL12THIRDPARTYALL-JOB1/_output/modulebuilds/repos/smrtlink-thirdparty-src/thirdparty.src/zlib/zlib_1.2.11/_output/install/include
#     CFLAGS:         -Wall -g -O2 -fvisibility=hidden
#     LDFLAGS:        -L/var/lib/bamboo/bamboo-agent-home/xml-data/build-dir/RBS-KESTREL12THIRDPARTYALL-JOB1/_output/modulebuilds/repos/smrtlink-thirdparty-src/thirdparty.src/zlib/zlib_1.2.11/_output/install/lib -fvisibility=hidden 
# 
# HTSlib URL scheme handlers present:
#     built-in:  preload, data, file
#     crypt4gh-needed:   crypt4gh
#     mem:   mem
samtools sort -O BAM -n -m 8000M -@ 12 /home/UNIXHOME/xliu/datasets/hifiasm_HG002trio/HG003Run01_S1.bam > pat.sorted.bam
samtools sort -O BAM -n -m 8000M -@ 12 /home/UNIXHOME/xliu/datasets/hifiasm_HG002trio/HG004Run01_S1.bam > mat.sorted.bam

# bedtools --version
# bedtools v2.27.1
(bedtools bamtofastq -i pat.sorted.bam -fq pat.1.fq -fq2 pat.2.fq) 2>&1 | tee bamtofastq.pat.log
(bedtools bamtofastq -i mat.sorted.bam -fq mat.1.fq -fq2 mat.2.fq) 2>&1 | tee bamtofastq.mat.log

yak=~/softwares/yak/yak
$yak count -k31 -b37 -t16 -o pat.yak <(cat pat.1.fq) <(cat pat.2.fq)
$yak count -k31 -b37 -t16 -o mat.yak <(cat mat.1.fq) <(cat mat.2.fq)

# hifiasm -v
# 0.19.5-r592

# bam2fastq --version
# bam2fastq 1.0.0 (commit v1.0.0-1-g3f82eff)
# 
# Using:
#   pbbam     : 2.3.0 (commit v2.3.0)
#   pbcopper  : 2.2.0 (commit v2.2.0)
#   boost     : 1.76
#   htslib    : 1.13
#   zlib      : 1.2.11

# HG002 data HiFi data:
# https://downloads.pacbcloud.com/public/revio/2022Q4/HG002-rep1/m84011_220902_175841_s1.hifi_reads.bam 
bam2fastq m84011_220902_175841_s1.hifi_reads.bam -o HG002-rep1.30X -u -j 32
hifiasm -o HG002.30X.asm/HG002.30X.asm -t 32 HG002-rep1.30X.fastq
hifiasm -o HG002.30X.asm/HG002.30X.asm -t 32 -1 pat.yak -2 mat.yak
chhylp123 commented 1 year ago

Yes, it is too high and shouldn't happen. When you ran hifiasm with trio-binning, were you using the following command lines?

hifiasm -o HG002.30X.asm/HG002.30X.asm -t 32 -1 pat.yak -2 mat.yak

Looks like there is no HiFi reads specified for hifiasm?

liu-xingliang commented 1 year ago

Thank you @chhylp123

It's a two commands combo following instruction here to get "both primary/alternate assemblies and trio binning assemblies":

image

My scripts with comments:

hifiasm -o HG002.30X.asm/HG002.30X.asm -t 32 HG002-rep1.30X.fastq # only use HiFi reads to get primary/alternative assemblies
hifiasm -o HG002.30X.asm/HG002.30X.asm -t 32 -1 pat.yak -2 mat.yak # trio binning assemblies
chhylp123 commented 1 year ago

Could you please have a try with:

hifiasm -o HG002.30X.asm/HG002.30X.asm -t 32 -1 pat.yak -2 mat.yak HG002-rep1.30X.fastq

We always need to give a file to hifiasm to make it think there is a HiFi read file.

liu-xingliang commented 1 year ago

@chhylp123 , thank you for comments. I tried to use one-line hifiasm trio assembly below, but switch error rate is same, and there is no *.bp.*.gfa output comparing to previous run:

coverage="full" # 30X
hifiasm=~/softwares/hifiasm/hifiasm
mkdir HG002.$coverage.asm_hifi
$hifiasm -o HG002.$coverage.asm_hifi/HG002.$coverage.asm -t 32 -1 pat.yak -2 mat.yak HG002-rep1.$coverage.fastq /dev/null 2> HG002.$coverage.asm_hifi.trio.log

awk '/^S/{print ">"$2;print $3}' HG002.full.asm_hifi/HG002.full.asm.dip.hap1.p_ctg.gfa > HG002.full.asm_hifi/HG002.full.asm.dip.hap1.p_ctg.gfa.fa
~/softwares/yak/yak trioeval pat.yak mat.yak HG002.full.asm_hifi/HG002.full.asm.dip.hap1.p_ctg.gfa.fa -t 12 > HG002.full.asm_hifi/HG002.full.asm.dip.hap1.p_ctg.trioeval.out

awk '/^S/{print ">"$2;print $3}' HG002.full.asm_hifi/HG002.full.asm.dip.hap2.p_ctg.gfa > HG002.full.asm_hifi/HG002.full.asm.dip.hap2.p_ctg.gfa.fa
~/softwares/yak/yak trioeval pat.yak mat.yak HG002.full.asm_hifi/HG002.full.asm.dip.hap2.p_ctg.gfa.fa -t 12 > HG002.full.asm_hifi/HG002.full.asm.dip.hap2.p_ctg.trioeval.out

tail -n 3 HG002.full.asm_hifi/HG002.full.asm.dip.hap1.p_ctg.trioeval.out
# W 86232   3612758 0.023869
# H 81960   3613135 0.022684
# N 3530950 82206   0.022752

tail -n 3 HG002.full.asm_hifi/HG002.full.asm.dip.hap2.p_ctg.trioeval.out
# W 61826   1904199 0.032468
# H 48191   1904521 0.025303
# N 48593   1855931 0.025515
# previous two-commands run
# hifiasm -o HG002.30X.asm/HG002.30X.asm -t 32 HG002-rep1.30X.fastq
# hifiasm -o HG002.30X.asm/HG002.30X.asm -t 32 -1 pat.yak -2 mat.yak

ls ~/hifiasm_trio/HG002.30X.asm

#  24G Jul 22 08:19 HG002.full.asm.ec.bin
#  12G Jul 22 08:31 HG002.full.asm.ovlp.source.bin
# 9.2G Jul 22 08:37 HG002.full.asm.ovlp.reverse.bin
# 6.5G Jul 22 08:47 HG002.full.asm.bp.r_utg.gfa
# 129M Jul 22 08:47 HG002.full.asm.bp.r_utg.noseq.gfa
#  11M Jul 22 08:49 HG002.full.asm.bp.r_utg.lowQ.bed
# 6.4G Jul 22 08:50 HG002.full.asm.bp.p_utg.gfa
# 127M Jul 22 08:50 HG002.full.asm.bp.p_utg.noseq.gfa
# 9.9M Jul 22 08:52 HG002.full.asm.bp.p_utg.lowQ.bed
# 3.0G Jul 22 08:55 HG002.full.asm.bp.p_ctg.gfa
#  67M Jul 22 08:55 HG002.full.asm.bp.p_ctg.noseq.gfa
# 2.2M Jul 22 08:56 HG002.full.asm.bp.p_ctg.lowQ.bed
# 3.0G Jul 22 10:25 HG002.full.asm.bp.hap1.p_ctg.gfa
#  65M Jul 22 10:25 HG002.full.asm.bp.hap1.p_ctg.noseq.gfa
# 2.7M Jul 22 10:25 HG002.full.asm.bp.hap1.p_ctg.lowQ.bed
# 2.8G Jul 22 10:26 HG002.full.asm.bp.hap2.p_ctg.gfa
#  61M Jul 22 10:26 HG002.full.asm.bp.hap2.p_ctg.noseq.gfa
# 2.2M Jul 22 10:27 HG002.full.asm.bp.hap2.p_ctg.lowQ.bed
#  74M Jul 28 13:00 HG002.full.asm.hap1.phase.bin
#  80M Jul 28 13:00 HG002.full.asm.hap2.phase.bin
# 6.5G Jul 28 13:11 HG002.full.asm.dip.r_utg.gfa
# 128M Jul 28 13:11 HG002.full.asm.dip.r_utg.noseq.gfa
#  11M Jul 28 13:13 HG002.full.asm.dip.r_utg.lowQ.bed
# 6.4G Jul 28 13:14 HG002.full.asm.dip.p_utg.gfa
# 127M Jul 28 13:14 HG002.full.asm.dip.p_utg.noseq.gfa
# 9.9M Jul 28 13:16 HG002.full.asm.dip.p_utg.lowQ.bed
# 2.8G Jul 28 19:00 HG002.full.asm.dip.hap1.p_ctg.gfa
#  62M Jul 28 19:00 HG002.full.asm.dip.hap1.p_ctg.noseq.gfa
# 2.6M Jul 28 19:01 HG002.full.asm.dip.hap1.p_ctg.lowQ.bed
# 2.9G Jul 28 19:02 HG002.full.asm.dip.hap2.p_ctg.gfa
#  64M Jul 28 19:02 HG002.full.asm.dip.hap2.p_ctg.noseq.gfa
# 2.4M Jul 28 19:02 HG002.full.asm.dip.hap2.p_ctg.lowQ.bed

# one-command run:
# hifiasm -o HG002.30X.asm_hifi/HG002.30X.asm -t 32 -1 pat.yak -2 mat.yak HG002-rep1.30X.fastq

ls ~/hifiasm_trio/HG002.30X.asm_hifi
#  56K Oct 26 06:24 ..
#  24G Oct 26 23:25 HG002.full.asm.ec.bin
#  74M Oct 26 23:25 HG002.full.asm.hap1.phase.bin
#  80M Oct 26 23:25 HG002.full.asm.hap2.phase.bin
#  12G Oct 26 23:27 HG002.full.asm.ovlp.source.bin
# 9.2G Oct 26 23:28 HG002.full.asm.ovlp.reverse.bin
# 6.5G Oct 27 13:37 HG002.full.asm.dip.r_utg.gfa
# 128M Oct 27 13:37 HG002.full.asm.dip.r_utg.noseq.gfa
#  11M Oct 27 13:38 HG002.full.asm.dip.r_utg.lowQ.bed
# 6.4G Oct 27 13:39 HG002.full.asm.dip.p_utg.gfa
# 127M Oct 27 13:39 HG002.full.asm.dip.p_utg.noseq.gfa
# 9.9M Oct 27 13:40 HG002.full.asm.dip.p_utg.lowQ.bed
# 2.8G Oct 27 16:35 HG002.full.asm.dip.hap1.p_ctg.gfa
#  62M Oct 27 16:35 HG002.full.asm.dip.hap1.p_ctg.noseq.gfa
# 2.6M Oct 27 16:35 HG002.full.asm.dip.hap1.p_ctg.lowQ.bed
# 2.9G Oct 27 16:37 HG002.full.asm.dip.hap2.p_ctg.gfa
#  64M Oct 27 16:37 HG002.full.asm.dip.hap2.p_ctg.noseq.gfa
# 2.4M Oct 27 16:37 HG002.full.asm.dip.hap2.p_ctg.lowQ.bed
# 2.8G Oct 27 18:04 HG002.full.asm.dip.hap1.p_ctg.gfa.fa
# 3.0M Oct 27 18:10 HG002.full.asm.dip.hap1.p_ctg.trioeval.out
# 2.9G Oct 27 18:11 HG002.full.asm.dip.hap2.p_ctg.gfa.fa

Thank you.

chhylp123 commented 1 year ago

@liu-xingliang Could you please show me the Bandage screenshot for the assembly graph? Could you please evaluate the hamming error rate of p_utg.gfa? If everything is right, the hamming/switch error rate of p_utg.gfa should be very low. In theory, HG002 is a standard sample which shouldn’t have issues. If possible, could you please share the bin files and the yak indexes with us? I will check the details on my side. Thanks so much!