VGP / vgp-assembly

VGP repository for the genome assembly working group
Other
185 stars 51 forks source link

Clarification in documentation: Polishing phased assemblies #60

Closed DustinSokolowski closed 2 years ago

DustinSokolowski commented 2 years ago

Hello!

Thank you for this unparalleled pipeline and resource. I've been following the VGP for my own de novo trio-binned assembly with linked reads and HiC data and I was hoping for some clarification in the genome polishing step.

I'm able to get the code working, however reading your documentation and papers from the last three years I'm still unable to determine which 10X data you use in the longranger-freebayes step of polishing.

For trio-binning, I have long reads of the F1, and 10X linked reads for the F1, maternal, and paternal animals.

Do you: 1) keep the maternal/paternal trio-binnned genomes separate and polish the maternal genome with the maternal 10X reads? 2) haplotype the F1's 10X reads and polish the maternal and paternal genomes separately 3) Combine the maternal/paternal genomes in the final step and polish with freebayes H1 and H2 before getting a consensus? 4) Another combination I haven't thought of.

It would be hugely helpful if you could clarify this for me and perhaps add it to the documentation for others thinking of the same issue.

Thank you so much and I'm sorry if I missed it.

Best, Dustin

gf777 commented 2 years ago

Hello Dustin,

thanks for reaching out. Even if you have parental data, these are only useful for contigging to initially separate the haplotypes or for evaluation purposes, they should NOT be used for any of the polishing. Note that one of the two parental haplotypes for each parent is NOT inherited, so you'd end with a hybrid of 4 haplotypes in total.

The way this works in VGP is that you combine the two haplotypes, map all your F1 reads simultaneously on both and call variants. This reduces the chances of introducing haplotype switches. We also suggest to check out Merfin, which we recently introduced to improve the quality of the polishing process https://github.com/arangrhie/merfin

DustinSokolowski commented 2 years ago

Thank you for the quick reply and saving me a world of future headache. One last question I have is about combining the haplotypes. Do you simply mean making sure the scaffolds have unique identifiers and catting them or is there a more formal type of combining? I think it’s just catting but I figured better safe than sorry.

I’ll definitely check out Merfin! Thank you for the suggestion.

gf777 commented 2 years ago

Hello Dustin, you are right just append a suffix so that your variants will be uniquely assigned to one scaffold per haplotype and then cat them into a single fasta. Good luck!

DustinSokolowski commented 2 years ago

Thanks so much! I really appreciate it.