PacificBiosciences / FALCON_unzip

Making diploid assembly becomes common practice for genomic study
BSD 3-Clause Clear License
30 stars 18 forks source link

location of Haplotigs on Primary contigs and read alignment to both #70

Open conte1 opened 7 years ago

conte1 commented 7 years ago

Hi,

I'm trying to figure out the best way use a final polished (4-quiver) assembly to align to both Primary and Associated contigs. It seems in Chin et al 2016 that haplotigs and primary contigs were combined and then short reads were aligned with bwa and all read mappings were considered since bwa outputs MAPQ=0 for ambiguously mapped reads.

I've been looking at bwi-kit and it seems the haplotigs could be used as ALT contigs. However, to do this, location information of the ALT contigs on the primary assembly is needed. So I'm trying get this information from the final 4-quiver assembly.

Is seems the output file 3-unzip/all_h_ctg_edges provides some of the information of where each haplotig diverges from the primary contigs? Is there an explanation of this file format somewhere? For example, what exactly does the following mean?

000000F_001 000792181:E 002163126:B N OP 46 0 46 0
000000F_001 002163126:B 001857450:E N OP 46 0 46 0

Finally, this information (haplotig to primary contig locations) seems to have been lost after the final quiver polishing step and the contig/haplotig coordinates have changed after polishing.

Has anyone else tried anything like this before?

Thanks, Matt

pb-jchin commented 7 years ago

Well, I have not tested and understood the BWA ALT contigs is suitable but it is worth to test it out. The safest bet to use nucmer or other whole genome aligner to identify the relationship. You might need to filter to result. Check @gconcepcion's https://github.com/gconcepcion/chain_filter

conte1 commented 7 years ago

Sounds good. I think I'll try to determine the differences between aligning with BWA ALT contigs versus aligning to combined haplotigs+primary contigs. Thanks.