chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
523 stars 86 forks source link

high hamming error with OmniC results #506

Open LHG-GG opened 1 year ago

LHG-GG commented 1 year ago

Hi @chhylp123

I have 2 assemblies, one with only HiFi data (50x) and HiFi (60x) + OmniC (30x). I was expecting with addition of OmniC, hamming error should improve compared to HiFi only assembly. But for both assemblies hamming errors are very similar. There is no issue with OmniC data, as >70% reads cis reads with >=10kb. With addition of ULONT (10x) to above HiFi+OmniC, hamming error improves significantly (expected)

HiFi +OmniC results are as expected?

    HiFi_Only   HiFi + OmniC    HiFi + ULONT+OmniC
SwitchError 5.439   5.538   4.8572
HammingError    31.646  31.301  5.164

log files are ok HiFI_Only.txt HiFi_OmniC.txt

Q2: I see there is newer version released last week, I have completed an assembly with 0.19.5-r593 in trio mode; do you recommend to rerun with latest version by enabling --trio-dual? This is the same sample as you see in the log file attached.

Thank you

chhylp123 commented 1 year ago

How do you calculate hamming errors? If the parental data is correct and hifiasm is wrong, the hamming error rate might be high, while the switch error rate should be smaller than 1%. However, the switch error rate of your assemblies are quite high. As such I feel like there might be some other issues.

'--trio-dual' is designed for some samples that trio-binning assemblies are significantly bad. If your trio-binning assemblies are not bad, it is not necessary to rerun hifiasm.

LHG-GG commented 1 year ago

>>How do you calculate hamming errors?

they were calculated using yak trioeval with parental short reads. I have multiplied the values from yak output by 100 to make them %Error. For ex: below is the result for HiFi_Only

W       255837  4703599 0.054392
H       1488675 4704121 0.316462
N       2780149 1923981 0.408998

For the same sample, trio phased assembly -switch and hamming errors are less than 1%.

I am trying to understand what went wrong with HiFi + OmniC results compared to HiFi only.

Thank you

chhylp123 commented 1 year ago

The Hi-C reads you were using for hifiasm are:

--h1 /mnt/IBM-Spectrum/Project/ARG/Prasad/Assembly_benchmark/HiFi_ULONT_OmniC/30/O_30_R1.fq.gz --h2 /mnt/IBM-Spectrum/Project/ARG/Prasad/Assembly_benchmark/HiFi_ULONT_OmniC/30/O_30_R1.fq.gz

I guess for --h2, you should take O_30_R2.fq.gz, instead of O_30_R1.fq.gz, right?

LHG-GG commented 1 year ago

That is correct. Parser that created hifiasm command needs to be corrected

From your experience, what is the expected switch and hamming errors using HiFi (Revio) + HiC data VS parental data (HiFi Revio) when evaluating with parental short reads (MGI) for human samples?

Thank you