dfguan / purge_dups

haplotypic duplication identification tool
MIT License
204 stars 19 forks source link

Purge_dups for CanuTrio assemblies #21

Open pcarbone opened 4 years ago

pcarbone commented 4 years ago

Hi Dengfeng, Would you advise to run Purge_dups on haplotype-binned assemblies generated with CanuTrio?

I have independent CanuTrio assemblies generated with either PacBio or ONT for the same individual. For the same haplotype, the variation between PB and ONT assemblies in terms of SV increases after Purge_dups (e.g.: SV difference in Assemblytics before and after Purge_dups is 0.2% and 1.0% of the genome, respectively). Would there be a way to control the haplotype of origin for the contigs that are deduplicated by Purge_dups?

Thanks.

dfguan commented 4 years ago

Hi,

Would you advise to run Purge_dups on haplotype-binned assemblies generated with CanuTrio?

I think that should depend on the BUSCO scores and KAT plot (if you have illumina data) before purging. Do the haplotypes look unclean? After purging, it is also good to know if purge_dups is overpuring by looking at the BUSCO scores and KAT plot.

Would there be a way to control the haplotype of origin for the contigs that are deduplicated by Purge_dups?

Purge_dups can't deal with the original haplotypes now, it only generates pseudohaplotypes.

SV difference in Assemblytics before and after Purge_dups is 0.2% and 1.0% of the genome, respectively.

Is there a theoritical number for SV difference?

Thanks for trying purge_dups, Dengfeng.

pcarbone commented 4 years ago

Thank you for the information Dengfeng. BUSCO before purging indicated that the assemblies for each haplotype have 10-17% duplication. KAT plot also supports the duplication.

I followed two strategies for purge_dups of trio-binned assemblies. The first, merging the assemblies of the two haplotypes before purging and splitting again by haplotype after purge_dups. The second, purging the assembly of each haplotype independently using the corresponding haplotype-binned reads.

The first strategy decreased BUSCO duplication to 2.5% but turned out too aggressive for one of the haplotypes in which BUSCO completeness decreased from 97% to 87%, while for the other haplotype completeness remained around 96%.

The second strategy was more conservative, decreasing duplication to 5% on each haplotype and completeness was kept around 97%. Still the problem for this strategy is that some purged contigs might actually correspond to the real haplotype and contigs from the other haplotype might have been kept instead. This issue is supported by the fact that SV difference between ONT and PacBio assemblies of the same haplotype increased after purging with this strategy. SV between ONT and PacBio assemblies of the same haplotype would ideally be 0% and it was 0.2% before purging and 1% after purge_dups with this second strategy.

Perhaps in this case the assemblies before purge_dups are more informative of the haplotypes despite their duplication.

Best, Pablo

dfguan commented 4 years ago

Hi Pablo,

The duplications seem to be real since purge_dups brings 10% to 5%, but this causes a large difference. I am wondering if polishing could fix the difference by pulling back the orignal sequence for the haplotype. I would suggest you to polish the assemblies and see if it works.

Cheers,

Dengfeng

pcarbone commented 4 years ago

Hi Dengfeng,

I see what you mean. I will try polishing with haplotype-binned reads after purge_dups. This should help for small SNV and InDel variants but not so sure for longer-range SV between haplotypes. I also agree with you that duplication should be real, just concerned if the haplotype that is purged out in some cases is not the correct one for duplicated contigs.

Thanks, Pablo