Open Pikayy opened 4 years ago
Hi Xuyue,
Does the second one has smaller N50? If so, you might need to tune some of the parameters. But it would be good to know the assembly metrics first, such as assembly size, N50, average contig size, etc..
Best,
Dengfeng.
Hi Dengfeng,
Yes it is quite a fragmented genome with fairly small N50. Here I attached the metrics generated from bbmap stats.sh
A C G T N IUPAC Other GC GC_stdev
0.3310 0.1690 0.1690 0.3310 0.0681 0.0000 0.0000 0.3380 0.0330
Main genome scaffold total: 74531
Main genome contig total: 92505
Main genome scaffold sequence total: 705.352 MB
Main genome contig sequence total: 657.336 MB 6.807% gap
Main genome scaffold N/L50: 3876/25.545 KB
Main genome contig N/L50: 12373/14.004 KB
Main genome scaffold N/L90: 37798/3.432 KB
Main genome contig N/L90: 52413/3.011 KB
Max scaffold length: 3.034 MB
Max contig length: 223.512 KB
Number of scaffolds > 50 KB: 1958
% main genome in scaffolds > 50 KB: 40.61%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 74,531 92,505 705,352,056 657,336,296 93.19%
500 74,531 92,505 705,352,056 657,336,296 93.19%
1 KB 74,531 92,505 705,352,056 657,336,296 93.19%
2.5 KB 47,290 64,182 662,843,644 614,869,524 92.76%
5 KB 27,708 43,337 592,985,370 545,057,460 91.92%
10 KB 12,925 26,604 489,156,926 441,308,216 90.22%
25 KB 3,991 15,368 355,607,761 308,400,821 86.72%
50 KB 1,958 11,787 286,441,252 241,984,222 84.48%
100 KB 1,007 8,829 220,378,340 184,047,280 83.51%
250 KB 253 3,686 100,056,614 83,501,304 83.45%
500 KB 34 881 26,673,550 22,763,010 85.34%
1 MB 5 221 8,013,224 7,078,744 88.34%
2.5 MB 1 79 3,034,087 2,698,167 88.93%
HereMain genome scaffold N/L50: 3876/25.545 KB
means the largest 3876 contigs accounts for
50% of the assembly length and those contigs are all ≥ 25.545kbp.
Cheers,
Hi Xuyue,
The N50 is a little bit small, you can reduce -M (gap size for chaining, say 1000) and -E (maximum extension, say 1000) values to get a more reasonable result. I am not sure this would work, I didn't test on an assembly on this scale. So just replace purge_dups with "purge_dups -M1000 -E1000 -T cutoffs -c PB.base.cov split.paf > dups.bed 2> log" see if it works. Best, Dengfeng
Hello again Dengfeng,
I am trying purge_dups on two genomes assembled by supernova. The raw assembly of the first genome is highly duplicated and purge_dups resulted in quite a big improvement of the quality (e.g., longer N50, less duplication from BUSCO results). But when I applied the same pipeline with the second genome, which is less duplicated from the first one, I got worse BUSCO results. Here I attached the BUSCO results before and after purge_dups. Is there a tendency of over-purging if the genome itself is not highly duplicated? Or have I missed anything? I have manually cutoffs file and the cutoff selected by the program looks good.
Before:
After
Best regards, Xuyue