chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
505 stars 84 forks source link

unbalanced trio binning assemblies #362

Open paulvhi opened 1 year ago

paulvhi commented 1 year ago

Are there some parameters I can adjust to improve my hifi trio binning assemblies? The hap1 assembly is too big (700m vs 530m), while the hap2 assembly is too small (370m vs 530m). Coverage is ~72X. Busco analysis indicates a lot of duplicates (C:98.9% [S:70.9%, D:28.0%], F:0.4%, M:0.7%, n:3285). I tried decreasing -s, but it didn't change much. The hamming rate on the hap1 assembly is below, and I attached the out file. Two factors that you should be aware of:

  1. The read length distribution is broad (10k -50k) to due to library construction with the low input procedure (i.e., no size selection).
  2. The male parent was sequenced to a depth of ~20X, while the female parent was sequenced to a depth of ~10X.

W 406694 11288140 0.036028 H 670168 11288372 0.059368 N 10281439 1006971 0.089204 hifiasm1.txt

chhylp123 commented 1 year ago

The switch error rate it too high. If the parental data is right but hifiasm did something wrong, the hamming error rate might be higher, but the switch error rate should be lower than 1%. Could you please pick a few long contigs to do double check for the phasing?

paulvhi commented 1 year ago

I attached a document with the contig length and frequency of switches, etc. I also included trioeval output. This is a highly repetitive and AT rich insect genome. I hope this helps.

switch error by contig.txt trioeval.txt

chhylp123 commented 1 year ago

Sorry for the late reply. Some contigs have very large number of hamming errors, like h1tg000005l, h1tg000013l and h1tg000031l. I am thinking if you could run yak trioeval on top of the p_utg.gfa, and check if the corresponding nodes in the graph of these contigs have large number of errors. Please see FAQ here: https://hifiasm.readthedocs.io/en/latest/faq.html#p-hamming. And there is an example: https://github.com/chhylp123/hifiasm/issues/130#issuecomment-862347943

paulvhi commented 1 year ago

I wanted to update you on the issues I ran into previously. It seems that combining the two PacBio cells caused some issues. It looks like two different individuals were sequenced. When we assembled reads from each cell separately, the results were much better. In addition, we sequenced the parents much deeper, which also improved the assembly. See attached. I think it looks really good now.

C.mac_update.pdf

chhylp123 commented 1 year ago

Great! Thanks for letting us know.