dfguan / purge_dups

haplotypic duplication identification tool
MIT License
205 stars 19 forks source link

RNA mapping rate reduce after purge_dups #24

Open baozg opened 4 years ago

baozg commented 4 years ago

Hi, Dengfeng

purge_dups is super easy to easy tool. Thanks for developing.

I use the Falcon-Unzip to assemble a outbreed plant (270M genome size, 1.8% het ,based on GenomeScope), but the Unzip cns_p_ctg.fasta is 450Mb, which is 1.5x of my estimated genome size. After assembly, I try the purge_dups and purge_haplotigs to remove haplotigs. The result are shown below.

Although purge_haplotigs have more contiguous genome, it seems have some unpurged haplotigs based on duplicated BUSCO(KAT plot also showed). I think the purge_dups assembly seems perfect based on KAT plot and BUSCO. So I have two question.

  1. How to save the RNA mapping rate? Would you give me some advice?
  2. Another question is the breaked assembly N can be filled by Arrow or Pilon?
Type Falcon-Unzip Purge_dups Purge_haplotigs
Assembly size 450,345,076 250,592,803 252,675,950
contig Number 436 242 101
contig N50 3,981,319 5,255,747 6,328,881
Complete BUSCOs 97.40% 97.60% 96.90%
Complete and single-copy BUSCOs 26.40% 95.10% 84.70%
Complete and duplicated BUSCOs 71% 2.50% 12.20%
Fragmented BUSCOs 0.90% 0.80% 0.90%
Missing BUSCOs 1.70% 1.60% 2.20%
IsoSeq Mapping Rate 97.29% 94.57% 94.61%
RNA Mapping Rate 95.13% 91.50% -

purge_base.cov the cutoffs are

5       48      78      94      157     282

PB cov

KAT plot for three assembly

Falcon_unzip Purge_haplotigs Purge_dups

Zhigui Bao

dfguan commented 4 years ago

Hi Zhigui,

Thanks for trying purge_dups, very impressive results.

For your questions:

  1. How to save the RNA mapping rate? Would you give me some advice? One thing you can try is polishing, use Arrow, Racon or some other tools. If this is not working, you could ask if it is normal, since the genes may appear in haplotigs, not in the primary contigs.

  2. N can be filled by Arrow or Pilon? The answer would be Yes for some gaps. We have some scaffolds, where the haplotypic duplications were removed nicely, Arrow fills their gaps.

Any further questions, please feel free to ask.

Thanks for trying purge_dups.

Dengfeng.

baozg commented 4 years ago

Hi Dengfeng,

Thanks for promptly reponse.

  1. I thick polish would save little gene back, since the Faclon Unzip version have the highests rate, so the gene could be in haplotigs. I will update the result when all polish done.

  2. Arrow / Racon could filled some gaps in assembly. But I have another questions for polish.

    • Typically, polish should do serveral rounds , my Falcon Unzip assembly (polish by arrow*1) seems good enough for the haplotype remove. For ONT assembly,however, it have higher error rate, worse BUSCO when the raw assembly finished. We typically polish more rounds (racon*3 + medaka + pilon*2). Which step should I do the purge_dups(racon & medaka based on ONT data, Pilon based on Illumina data)? Could it purged when the Canu finished, since purge_dups only need raw ONT data and assembly itself.

    • After purging, the primary contig and haplotigs fasta should polish together or separately?

      • Combine the primary and haplotig, do minimap2 align and polish.

      • Mapping the raw data to primary and haplotig separately and polish separately.

  3. purge haplotig in $hap_asm

    readme said:

    Step 4. Merge hap.fa and $hap_asm and redo the above steps to get a decent haplotig set.

    Just to be clear, so I need cat the hap.fa (purge from the primary contig of Unzip), then cat hap.fa and $hap_asm (cns_h_ctg.fa), then run the purge_dups pipeline, get purged.fa and hap.fa. So the this round purged.fa is all my decent haplotig?

Zhigui Bao

dfguan commented 4 years ago

Hi Zhigui,

For your questions:

Which step should I do the purge_dups(racon & medaka based on ONT data, Pilon based on Illumina data)? Could it purged when the Canu finished, since purge_dups only need raw ONT data and assembly itself?

For the first question, kinda complex to me, I do not have a good answer, maybe run purge_dups after all steps are done. As for the second question, you could run purge_dups on canu assembly, just the parameters are optimized for falcon-unzip contigs, you may need to tune them. Tell me if it is not working well. And remember to change minimap2 option for ONT data.

After purging, the primary contig and haplotigs fasta should polish together or separately?

I think combine both file and polishing should be the right way, in this way, reads for primary contigs and their corresponding haplotigs can be assigned correctly to different loci. If you polish separately using all reads, the reads for primary contigs and corresponding haplotigs will be mapped to the same place, which will lead to a wrong polishing results. Is it clear?

Step 4. purged.fa is all my decent haplotig? Yes.

Cheers.

Dengfeng.