dfguan / purge_dups

haplotypic duplication identification tool
MIT License
202 stars 19 forks source link

Worse BUSCO results after purging #15

Open Pikayy opened 4 years ago

Pikayy commented 4 years ago

Hello again Dengfeng,

I am trying purge_dups on two genomes assembled by supernova. The raw assembly of the first genome is highly duplicated and purge_dups resulted in quite a big improvement of the quality (e.g., longer N50, less duplication from BUSCO results). But when I applied the same pipeline with the second genome, which is less duplicated from the first one, I got worse BUSCO results. Here I attached the BUSCO results before and after purge_dups. Is there a tendency of over-purging if the genome itself is not highly duplicated? Or have I missed anything? I have manually cutoffs file and the cutoff selected by the program looks good.

Before:

858     Complete BUSCOs
--
729     Complete and single-copy BUSCOs
129     Complete and duplicated BUSCOs
29      Fragmented BUSCOs
179     Missing BUSCOs
1066    Total BUSCO groups searched

After

629     Complete BUSCOs
--
575     Complete and single-copy   BUSCOs
54      Complete and duplicated   BUSCOs
29      Fragmented BUSCOs
408     Missing BUSCOs
1066    Total BUSCO groups   searched

Best regards, Xuyue

dfguan commented 4 years ago

Hi Xuyue,

Does the second one has smaller N50? If so, you might need to tune some of the parameters. But it would be good to know the assembly metrics first, such as assembly size, N50, average contig size, etc..

Best,

Dengfeng.

Pikayy commented 4 years ago

Hi Dengfeng,

Yes it is quite a fragmented genome with fairly small N50. Here I attached the metrics generated from bbmap stats.sh

A   C   G   T   N   IUPAC   Other   GC  GC_stdev
0.3310  0.1690  0.1690  0.3310  0.0681  0.0000  0.0000  0.3380  0.0330

Main genome scaffold total:             74531
Main genome contig total:               92505
Main genome scaffold sequence total:    705.352 MB
Main genome contig sequence total:      657.336 MB      6.807% gap
Main genome scaffold N/L50:             3876/25.545 KB
Main genome contig N/L50:               12373/14.004 KB
Main genome scaffold N/L90:             37798/3.432 KB
Main genome contig N/L90:               52413/3.011 KB
Max scaffold length:                    3.034 MB
Max contig length:                      223.512 KB
Number of scaffolds > 50 KB:            1958
% main genome in scaffolds > 50 KB:     40.61%

Minimum     Number          Number          Total           Total           Scaffold
Scaffold    of              of              Scaffold        Contig          Contig  
Length      Scaffolds       Contigs         Length          Length          Coverage
--------    --------------  --------------  --------------  --------------  --------
    All             74,531          92,505     705,352,056     657,336,296    93.19%
    500             74,531          92,505     705,352,056     657,336,296    93.19%
   1 KB             74,531          92,505     705,352,056     657,336,296    93.19%
 2.5 KB             47,290          64,182     662,843,644     614,869,524    92.76%
   5 KB             27,708          43,337     592,985,370     545,057,460    91.92%
  10 KB             12,925          26,604     489,156,926     441,308,216    90.22%
  25 KB              3,991          15,368     355,607,761     308,400,821    86.72%
  50 KB              1,958          11,787     286,441,252     241,984,222    84.48%
 100 KB              1,007           8,829     220,378,340     184,047,280    83.51%
 250 KB                253           3,686     100,056,614      83,501,304    83.45%
 500 KB                 34             881      26,673,550      22,763,010    85.34%
   1 MB                  5             221       8,013,224       7,078,744    88.34%
 2.5 MB                  1              79       3,034,087       2,698,167    88.93%

HereMain genome scaffold N/L50: 3876/25.545 KB means the largest 3876 contigs accounts for 50% of the assembly length and those contigs are all ≥ 25.545kbp.

Cheers,

dfguan commented 4 years ago

Hi Xuyue,

The N50 is a little bit small, you can reduce -M (gap size for chaining, say 1000) and -E (maximum extension, say 1000) values to get a more reasonable result. I am not sure this would work, I didn't test on an assembly on this scale. So just replace purge_dups with "purge_dups -M1000 -E1000 -T cutoffs -c PB.base.cov split.paf > dups.bed 2> log" see if it works. Best, Dengfeng