Open HenrivdGeest opened 5 years ago
Hi Henri, the results depends on the heterozygosity and repeats of your assemblies, even my tests are based on primary assemblies. I would recommend you to try and see how the busco scores changes. You may need to change some parameters, but without seeing the results, it's hard to say whether purge_dups can work. As for the self-to-self alignment, purge_dups has an option (-l) which can filter out those matching short repeats.
Hi, I tried in to a plant assembly which shows a high busco duplicates scores of >50%. The purged.fa show a good reduction to 10% duplicate, which is a good reduction. But My feeling is that it should be possible to purge more. However I am now struggling with understanding the parameters, mainly on the purge_dups scripts. Can you elaborate or point me to (in case I missed it) a bit more descriptive manual?
/bin/purge_dups
Usage: update [options] <PAF>
Options:
-c STR base-level coverage file [NULL]
-T STR cutoffs file [NULL]
**-f INT minimum fraction of haploid/diploid/bad/repetitive bases in a sequence [.8]**
-a INT minimum alignment score [50]
-b INT minimum max match score [200]
-2 BOOL 2 rounds chaining [FALSE]
-m INT minimum matching bases for chaining [500]
-M INT maximum gap size for chaining [20K]
-G INT maximum gap size for 2nd round chaining [50K]
-l INT minimum alignment block for an overlap [10K]
**-E INT maximum extension for contig ends [15K]**
-h help
for me the -f, -G and -E are the once which I want to change, but I do not fully understand them.
The coverage plot looks like:
and the cutoffs file:
5 7 49 50 60 165
Hi Henri, -f is set for a suspect haplotigs, if 80% of a scaffold is high covrage (coverage > 165 in your case), it's a repetitive contig, 80% is low coverage (coverage < 5), it's a junk contig, 80% is above diploid coverage(50 in your case), it's a diploid, otherwise it's a suspect haplotig. -G is set for second round chaining, in the first round, asset chains consistent alignments within 20 kb, the second round 50 kb. -E is the match extension, if the chained alignment is within 15 kb to the contig ends, it will extended to the ends.
-l is for controling overlap size, you can also decrease its value to allow more overlaps.
How does purge_haplotigs work on your assembly?
Dengfeng
It did remove many duplicated contigs, But I don't really like the fact that it also merges contigs. After I aligned the old contigs and the long reads, I could not find evidence that the contigs should have been merged. It might be correct in the end, but I rather not touch the contigs at all, just remove the redundant ones. We now moved back to using purge_haplotigs again.
Hi Henri, what do you mean by merging contigs? Could you please give me an example, say dotplot to show the contig before and after merging? That would be helpful for me to update purge_dups. Thanks. Dengfeng.
Hi Dengfeng,
I may be experiencing a similar issue here. I have a gap free assembly being used as input to the purge_dups pipeline. The assembly consists of primary contigs based on PacBio reads assembled with canu. The assembly stats look pretty good (e.g., NG50 >7Mb, LG50 24). I was able to run the pipeline pretty smoothly; the cutoffs chosen were very reasonable (I had two clear peaks with a valley between and haploid/diploid cutoff was in the bottom of the valley). When the whole process was done, I was surprised to see gaps in the final.purged.fa
file. 16 gaps were generated across 12 sequences. Each gap was exactly 23 N's long. At first glance, my assumption was that two shorter contigs were merged together with a gap. I have investigated more closely and realized that this is not the case. I wonder if Henri saw the gaps and made the same assumption initially.
Regardless, I have determined that this wasn't a merging of sequence, but rather a deletion of a large chunk which was replaced by a small number of Ns. For example, in one contig which had 194,313 bases, 48,279 bases (109,301-157,580) were excised and replaced with 23 Ns. What is the justification for something like this happening?
I took a closer look at things and discovered the following relevant lines in the dups.bed
file:
tig00002458 0 64876 REPEAT tig00002442
tig00000284 0 188425 REPEAT tig00000287
tig00000287 109300 157579 OVLP tig00002442
tig00000297 0 78604 REPEAT tig00000287
tig00002702 0 79150 REPEAT tig00000287
The example I mentioned in my previous comment occurred in tig00000287
.
The contigs in the lines I showed above are not present on any other line in the dups.bed
file. Of the contigs listed, none are present in final.hap.fa
. Only two, tig00000287
and tig00002442
are present in final.purged.fa
. Here are the lengths of each of these contigs in the original contig fasta file:
contig length
----------------------
tig00000284 188425
tig00000287 194313
tig00000297 78604
tig00002442 575628
tig00002458 64876
tig00002702 79150
Note that in final.purged.fa
, tig00002442
has no deletion. Here are the alignments from the paf file between contigs:
tig00000287:1-194313 194313 109300 127490 - tig00002442:1-575628 575628 33058 51370 12707 18322 0 tp:A:S cm:i:1083 s1:i:12702 dv:f:0.0217 rl:i:52972
tig00000287:1-194313 194313 112590 130985 - tig00002442:1-575628 575628 33058 51581 12026 18540 0 tp:A:S cm:i:1009 s1:i:12014 dv:f:0.0260 rl:i:52972
tig00000287:1-194313 194313 116096 134453 - tig00002442:1-575628 575628 33058 51581 11156 18548 0 tp:A:S cm:i:922 s1:i:11132 dv:f:0.0307 rl:i:52972
tig00000287:1-194313 194313 109300 124014 - tig00002442:1-575628 575628 33058 47847 10525 14799 0 tp:A:S cm:i:903 s1:i:10521 dv:f:0.0203 rl:i:52972
tig00000287:1-194313 194313 119600 134657 - tig00002442:1-575628 575628 36370 51581 8723 15234 0 tp:A:S cm:i:707 s1:i:8702 dv:f:0.0343 rl:i:52972
tig00000287:1-194313 194313 109300 120526 - tig00002442:1-575628 575628 33058 44332 8161 11279 0 tp:A:S cm:i:702 s1:i:8158 dv:f:0.0194 rl:i:52972
tig00000287:1-194313 194313 123096 134657 - tig00002442:1-575628 575628 39887 51581 6214 11716 0 tp:A:S cm:i:491 s1:i:6195 dv:f:0.0397 rl:i:52972
tig00000287:1-194313 194313 109300 117026 - tig00002442:1-575628 575628 33058 40814 5595 7764 0 tp:A:S cm:i:475 s1:i:5594 dv:f:0.0200 rl:i:52972
tig00000287:1-194313 194313 126566 134657 - tig00002442:1-575628 575628 43403 51581 4029 8198 0 tp:A:S cm:i:308 s1:i:4013 dv:f:0.0453 rl:i:52972
tig00000287:1-194313 194313 109300 113519 - tig00002442:1-575628 575628 33058 37299 2953 4245 0 tp:A:S cm:i:247 s1:i:2952 dv:f:0.0231 rl:i:52972
tig00000287:1-194313 194313 130131 134657 - tig00002442:1-575628 575628 46919 51521 2108 4609 0 tp:A:S cm:i:159 s1:i:2094 dv:f:0.0502 rl:i:52972
tig00000287:1-194313 194313 156569 157579 - tig00002442:1-575628 575628 605 1616 599 1011 0 tp:A:S cm:i:51 s1:i:599 dv:f:0.0128 rl:i:52972
tig00000287:1-194313 194313 156569 157579 - tig00002442:1-575628 575628 1727 2737 588 1011 0 tp:A:S cm:i:50 s1:i:588 dv:f:0.0138 rl:i:52972
tig00000287:1-194313 194313 109300 110015 - tig00002442:1-575628 575628 33058 33783 457 725 0 tp:A:S cm:i:38 s1:i:456 dv:f:0.0329 rl:i:52972
tig00000287:1-194313 194313 133616 134657 - tig00002442:1-575628 575628 50440 51521 402 1081 0 tp:A:S cm:i:29 s1:i:393 dv:f:0.0667 rl:i:52972
tig00000287:1-194313 194313 48402 49502 - tig00002442:1-575628 575628 57953 59035 362 1102 0 tp:A:S cm:i:31 s1:i:357 dv:f:0.0662 rl:i:52972
tig00000287:1-194313 194313 138276 141770 - tig00002442:1-575628 575628 52166 55689 217 3523 0 tp:A:S cm:i:14 s1:i:211 dv:f:0.1552 rl:i:52972
tig00000287:1-194313 194313 150981 154479 - tig00002442:1-575628 575628 52166 55689 205 3523 0 tp:A:S cm:i:13 s1:i:201 dv:f:0.1593 rl:i:52972
tig00000287:1-194313 194313 137710 141194 - tig00002442:1-575628 575628 52166 55677 205 3511 0 tp:A:S cm:i:13 s1:i:198 dv:f:0.1589 rl:i:52972
tig00000287:1-194313 194313 151559 155056 - tig00002442:1-575628 575628 52166 55689 193 3523 0 tp:A:S cm:i:12 s1:i:187 dv:f:0.1635 rl:i:52972
tig00000287:1-194313 194313 149826 153324 - tig00002442:1-575628 575628 52166 55689 186 3523 0 tp:A:S cm:i:12 s1:i:182 dv:f:0.1635 rl:i:52972
tig00000287:1-194313 194313 142892 146390 - tig00002442:1-575628 575628 52166 55689 186 3523 0 tp:A:S cm:i:12 s1:i:181 dv:f:0.1635 rl:i:52972
tig00000287:1-194313 194313 145203 148700 - tig00002442:1-575628 575628 52166 55689 186 3523 0 tp:A:S cm:i:12 s1:i:181 dv:f:0.1633 rl:i:52972
tig00000287:1-194313 194313 138853 141770 - tig00002442:1-575628 575628 52747 55689 186 2942 0 tp:A:S cm:i:12 s1:i:179 dv:f:0.1545 rl:i:52972
tig00000287:1-194313 194313 140586 144079 - tig00002442:1-575628 575628 52166 55689 186 3523 0 tp:A:S cm:i:12 s1:i:179 dv:f:0.1627 rl:i:52972
tig00000287:1-194313 194313 140008 143501 - tig00002442:1-575628 575628 52166 55689 186 3523 0 tp:A:S cm:i:12 s1:i:179 dv:f:0.1627 rl:i:52972
tig00000287:1-194313 194313 139431 142923 - tig00002442:1-575628 575628 52166 55689 186 3523 0 tp:A:S cm:i:12 s1:i:179 dv:f:0.1631 rl:i:52972
tig00000287:1-194313 194313 150981 153901 - tig00002442:1-575628 575628 52166 55107 174 2941 0 tp:A:S cm:i:11 s1:i:170 dv:f:0.1586 rl:i:52972
tig00000287:1-194313 194313 137710 140617 - tig00002442:1-575628 575628 52166 55095 174 2929 0 tp:A:S cm:i:11 s1:i:170 dv:f:0.1581 rl:i:52972
tig00000287:1-194313 194313 152137 155056 - tig00002442:1-575628 575628 52747 55689 162 2942 0 tp:A:S cm:i:10 s1:i:157 dv:f:0.1636 rl:i:52972
tig00000287:1-194313 194313 136559 140039 - tig00002442:1-575628 575628 52166 55677 162 3511 0 tp:A:S cm:i:10 s1:i:155 dv:f:0.1721 rl:i:52972
tig00000287:1-194313 194313 152715 156210 - tig00002442:1-575628 575628 52166 55689 162 3523 0 tp:A:S cm:i:10 s1:i:155 dv:f:0.1735 rl:i:52972
tig00000287:1-194313 194313 142892 145812 - tig00002442:1-575628 575628 52166 55107 155 2941 0 tp:A:S cm:i:10 s1:i:152 dv:f:0.1636 rl:i:52972
tig00000287:1-194313 194313 149826 152746 - tig00002442:1-575628 575628 52166 55107 155 2941 0 tp:A:S cm:i:10 s1:i:151 dv:f:0.1636 rl:i:52972
tig00000287:1-194313 194313 145203 148122 - tig00002442:1-575628 575628 52166 55107 155 2941 0 tp:A:S cm:i:10 s1:i:151 dv:f:0.1634 rl:i:52972
tig00000287:1-194313 194313 148669 152168 - tig00002442:1-575628 575628 52166 55689 155 3523 0 tp:A:S cm:i:10 s1:i:150 dv:f:0.1731 rl:i:52972
tig00000287:1-194313 194313 143470 146390 - tig00002442:1-575628 575628 52747 55689 155 2942 0 tp:A:S cm:i:10 s1:i:150 dv:f:0.1636 rl:i:52972
tig00000287:1-194313 194313 144048 147544 - tig00002442:1-575628 575628 52166 55689 155 3523 0 tp:A:S cm:i:10 s1:i:150 dv:f:0.1729 rl:i:52972
tig00000287:1-194313 194313 148091 151590 - tig00002442:1-575628 575628 52166 55689 155 3523 0 tp:A:S cm:i:10 s1:i:150 dv:f:0.1731 rl:i:52972
tig00000287:1-194313 194313 145781 148700 - tig00002442:1-575628 575628 52747 55689 155 2942 0 tp:A:S cm:i:10 s1:i:150 dv:f:0.1634 rl:i:52972
tig00000287:1-194313 194313 141739 145234 - tig00002442:1-575628 575628 52166 55689 155 3523 0 tp:A:S cm:i:10 s1:i:149 dv:f:0.1723 rl:i:52972
tig00000287:1-194313 194313 147513 151012 - tig00002442:1-575628 575628 52166 55689 155 3523 0 tp:A:S cm:i:10 s1:i:149 dv:f:0.1731 rl:i:52972
tig00000287:1-194313 194313 146359 149857 - tig00002442:1-575628 575628 52166 55689 155 3523 0 tp:A:S cm:i:10 s1:i:149 dv:f:0.1729 rl:i:52972
tig00000287:1-194313 194313 141163 144079 - tig00002442:1-575628 575628 52747 55689 155 2942 0 tp:A:S cm:i:10 s1:i:148 dv:f:0.1627 rl:i:52972
tig00000287:1-194313 194313 136559 139462 - tig00002442:1-575628 575628 52166 55095 131 2929 0 tp:A:S cm:i:8 s1:i:126 dv:f:0.1737 rl:i:52972
tig00000287:1-194313 194313 153293 156210 - tig00002442:1-575628 575628 52747 55689 131 2942 0 tp:A:S cm:i:8 s1:i:125 dv:f:0.1758 rl:i:52972
tig00000287:1-194313 194313 147513 149857 - tig00002442:1-575628 575628 52747 55107 124 2360 0 tp:A:S cm:i:8 s1:i:121 dv:f:0.1639 rl:i:52972
tig00000287:1-194313 194313 153882 156210 - tig00002442:1-575628 575628 53329 55677 100 2348 0 tp:A:S cm:i:6 s1:i:95 dv:f:0.1793 rl:i:52972
tig00000287:1-194313 194313 136559 138884 - tig00002442:1-575628 575628 52166 54512 100 2346 0 tp:A:S cm:i:6 s1:i:95 dv:f:0.1769 rl:i:52972
tig00000287:1-194313 194313 154448 156210 - tig00002442:1-575628 575628 53912 55689 81 1777 0 tp:A:S cm:i:5 s1:i:78 dv:f:0.1746 rl:i:52972
tig00000287:1-194313 194313 136559 138307 - tig00002442:1-575628 575628 52166 53931 69 1765 0 tp:A:S cm:i:4 s1:i:66 dv:f:0.1840 rl:i:52972
tig00000287:1-194313 194313 155037 156210 - tig00002442:1-575628 575628 54493 55677 50 1184 0 tp:A:S cm:i:3 s1:i:47 dv:f:0.1807 rl:i:52972
tig00002442:1-575628 575628 33058 51370 - tig00000287:1-194313 194313 109300 127490 12707 18322 0 tp:A:S cm:i:1083 s1:i:12702 dv:f:0.0203 rl:i:53209
tig00002442:1-575628 575628 33058 51581 - tig00000287:1-194313 194313 112590 130985 12026 18540 0 tp:A:S cm:i:1009 s1:i:12014 dv:f:0.0248 rl:i:53209
tig00002442:1-575628 575628 33058 51581 - tig00000287:1-194313 194313 116096 134453 11156 18548 0 tp:A:S cm:i:922 s1:i:11132 dv:f:0.0295 rl:i:53209
tig00002442:1-575628 575628 33058 47847 - tig00000287:1-194313 194313 109300 124014 10525 14799 0 tp:A:S cm:i:903 s1:i:10521 dv:f:0.0186 rl:i:53209
tig00002442:1-575628 575628 36370 51581 - tig00000287:1-194313 194313 119600 134657 8723 15234 0 tp:A:S cm:i:707 s1:i:8702 dv:f:0.0332 rl:i:53209
tig00002442:1-575628 575628 33058 44332 - tig00000287:1-194313 194313 109300 120526 8161 11279 0 tp:A:S cm:i:702 s1:i:8158 dv:f:0.0176 rl:i:53209
tig00002442:1-575628 575628 39887 51581 - tig00000287:1-194313 194313 123096 134657 6214 11716 0 tp:A:S cm:i:491 s1:i:6195 dv:f:0.0385 rl:i:53209
tig00002442:1-575628 575628 33058 40814 - tig00000287:1-194313 194313 109300 117026 5595 7764 0 tp:A:S cm:i:475 s1:i:5594 dv:f:0.0187 rl:i:53209
tig00002442:1-575628 575628 43403 51581 - tig00000287:1-194313 194313 126566 134657 4029 8198 0 tp:A:S cm:i:308 s1:i:4013 dv:f:0.0445 rl:i:53209
tig00002442:1-575628 575628 33058 37299 - tig00000287:1-194313 194313 109300 113519 2953 4245 0 tp:A:S cm:i:247 s1:i:2952 dv:f:0.0214 rl:i:53209
tig00002442:1-575628 575628 46919 51521 - tig00000287:1-194313 194313 130131 134657 2108 4609 0 tp:A:S cm:i:159 s1:i:2094 dv:f:0.0492 rl:i:53209
tig00002442:1-575628 575628 605 1616 - tig00000287:1-194313 194313 156569 157579 599 1011 0 tp:A:S cm:i:51 s1:i:599 dv:f:0.0128 rl:i:53209
tig00002442:1-575628 575628 1727 2737 - tig00000287:1-194313 194313 156569 157579 588 1011 0 tp:A:S cm:i:50 s1:i:588 dv:f:0.0138 rl:i:53209
tig00002442:1-575628 575628 33058 33783 - tig00000287:1-194313 194313 109300 110015 457 725 0 tp:A:S cm:i:38 s1:i:456 dv:f:0.0291 rl:i:53209
tig00002442:1-575628 575628 50440 51521 - tig00000287:1-194313 194313 133616 134657 402 1081 0 tp:A:S cm:i:29 s1:i:393 dv:f:0.0641 rl:i:53209
tig00002442:1-575628 575628 57953 59035 - tig00000287:1-194313 194313 48402 49502 362 1102 0 tp:A:S cm:i:31 s1:i:357 dv:f:0.0712 rl:i:53209
tig00002442:1-575628 575628 52166 55689 - tig00000287:1-194313 194313 138276 141770 217 3523 0 tp:A:S cm:i:14 s1:i:211 dv:f:0.1649 rl:i:53209
tig00002442:1-575628 575628 52166 55689 - tig00000287:1-194313 194313 150981 154479 205 3523 0 tp:A:S cm:i:13 s1:i:201 dv:f:0.1688 rl:i:53209
tig00002442:1-575628 575628 52166 55677 - tig00000287:1-194313 194313 137710 141194 205 3511 0 tp:A:S cm:i:13 s1:i:198 dv:f:0.1686 rl:i:53209
tig00002442:1-575628 575628 52166 55689 - tig00000287:1-194313 194313 151559 155056 193 3523 0 tp:A:S cm:i:12 s1:i:187 dv:f:0.1730 rl:i:53209
tig00002442:1-575628 575628 52166 55689 - tig00000287:1-194313 194313 149826 153324 186 3523 0 tp:A:S cm:i:12 s1:i:182 dv:f:0.1730 rl:i:53209
tig00002442:1-575628 575628 52166 55689 - tig00000287:1-194313 194313 145203 148700 186 3523 0 tp:A:S cm:i:12 s1:i:181 dv:f:0.1730 rl:i:53209
tig00002442:1-575628 575628 52166 55689 - tig00000287:1-194313 194313 142892 146390 186 3523 0 tp:A:S cm:i:12 s1:i:181 dv:f:0.1730 rl:i:53209
tig00002442:1-575628 575628 52747 55689 - tig00000287:1-194313 194313 138853 141770 186 2942 0 tp:A:S cm:i:12 s1:i:179 dv:f:0.1635 rl:i:53209
tig00002442:1-575628 575628 52166 55689 - tig00000287:1-194313 194313 140586 144079 186 3523 0 tp:A:S cm:i:12 s1:i:179 dv:f:0.1730 rl:i:53209
tig00002442:1-575628 575628 52166 55689 - tig00000287:1-194313 194313 140008 143501 186 3523 0 tp:A:S cm:i:12 s1:i:179 dv:f:0.1730 rl:i:53209
tig00002442:1-575628 575628 52166 55689 - tig00000287:1-194313 194313 139431 142923 186 3523 0 tp:A:S cm:i:12 s1:i:179 dv:f:0.1730 rl:i:53209
tig00002442:1-575628 575628 52166 55107 - tig00000287:1-194313 194313 150981 153901 174 2941 0 tp:A:S cm:i:11 s1:i:170 dv:f:0.1683 rl:i:53209
tig00002442:1-575628 575628 52166 55095 - tig00000287:1-194313 194313 137710 140617 174 2929 0 tp:A:S cm:i:11 s1:i:170 dv:f:0.1681 rl:i:53209
tig00002442:1-575628 575628 52747 55689 - tig00000287:1-194313 194313 152137 155056 162 2942 0 tp:A:S cm:i:10 s1:i:157 dv:f:0.1731 rl:i:53209
tig00002442:1-575628 575628 52166 55689 - tig00000287:1-194313 194313 152715 156210 162 3523 0 tp:A:S cm:i:10 s1:i:155 dv:f:0.1826 rl:i:53209
tig00002442:1-575628 575628 52166 55677 - tig00000287:1-194313 194313 136559 140039 162 3511 0 tp:A:S cm:i:10 s1:i:155 dv:f:0.1824 rl:i:53209
tig00002442:1-575628 575628 52166 55107 - tig00000287:1-194313 194313 142892 145812 155 2941 0 tp:A:S cm:i:10 s1:i:152 dv:f:0.1733 rl:i:53209
tig00002442:1-575628 575628 52166 55107 - tig00000287:1-194313 194313 149826 152746 155 2941 0 tp:A:S cm:i:10 s1:i:151 dv:f:0.1733 rl:i:53209
tig00002442:1-575628 575628 52166 55107 - tig00000287:1-194313 194313 145203 148122 155 2941 0 tp:A:S cm:i:10 s1:i:151 dv:f:0.1733 rl:i:53209
tig00002442:1-575628 575628 52747 55689 - tig00000287:1-194313 194313 145781 148700 155 2942 0 tp:A:S cm:i:10 s1:i:150 dv:f:0.1731 rl:i:53209
tig00002442:1-575628 575628 52166 55689 - tig00000287:1-194313 194313 148091 151590 155 3523 0 tp:A:S cm:i:10 s1:i:150 dv:f:0.1826 rl:i:53209
tig00002442:1-575628 575628 52166 55689 - tig00000287:1-194313 194313 148669 152168 155 3523 0 tp:A:S cm:i:10 s1:i:150 dv:f:0.1826 rl:i:53209
tig00002442:1-575628 575628 52166 55689 - tig00000287:1-194313 194313 144048 147544 155 3523 0 tp:A:S cm:i:10 s1:i:150 dv:f:0.1826 rl:i:53209
tig00002442:1-575628 575628 52747 55689 - tig00000287:1-194313 194313 143470 146390 155 2942 0 tp:A:S cm:i:10 s1:i:150 dv:f:0.1731 rl:i:53209
tig00002442:1-575628 575628 52166 55689 - tig00000287:1-194313 194313 141739 145234 155 3523 0 tp:A:S cm:i:10 s1:i:149 dv:f:0.1826 rl:i:53209
tig00002442:1-575628 575628 52166 55689 - tig00000287:1-194313 194313 147513 151012 155 3523 0 tp:A:S cm:i:10 s1:i:149 dv:f:0.1826 rl:i:53209
tig00002442:1-575628 575628 52166 55689 - tig00000287:1-194313 194313 146359 149857 155 3523 0 tp:A:S cm:i:10 s1:i:149 dv:f:0.1826 rl:i:53209
tig00002442:1-575628 575628 52747 55689 - tig00000287:1-194313 194313 141163 144079 155 2942 0 tp:A:S cm:i:10 s1:i:148 dv:f:0.1731 rl:i:53209
tig00002442:1-575628 575628 52166 55095 - tig00000287:1-194313 194313 136559 139462 131 2929 0 tp:A:S cm:i:8 s1:i:126 dv:f:0.1848 rl:i:53209
tig00002442:1-575628 575628 52747 55689 - tig00000287:1-194313 194313 153293 156210 131 2942 0 tp:A:S cm:i:8 s1:i:125 dv:f:0.1848 rl:i:53209
tig00002442:1-575628 575628 52747 55107 - tig00000287:1-194313 194313 147513 149857 124 2360 0 tp:A:S cm:i:8 s1:i:121 dv:f:0.1735 rl:i:53209
tig00002442:1-575628 575628 53329 55677 - tig00000287:1-194313 194313 153882 156210 100 2348 0 tp:A:S cm:i:6 s1:i:95 dv:f:0.1884 rl:i:53209
tig00002442:1-575628 575628 52166 54512 - tig00000287:1-194313 194313 136559 138884 100 2346 0 tp:A:S cm:i:6 s1:i:95 dv:f:0.1884 rl:i:53209
tig00002442:1-575628 575628 53912 55689 - tig00000287:1-194313 194313 154448 156210 81 1777 0 tp:A:S cm:i:5 s1:i:78 dv:f:0.1834 rl:i:53209
tig00002442:1-575628 575628 52166 53931 - tig00000287:1-194313 194313 136559 138307 69 1765 0 tp:A:S cm:i:4 s1:i:66 dv:f:0.1945 rl:i:53209
tig00002442:1-575628 575628 54493 55677 - tig00000287:1-194313 194313 155037 156210 50 1184 0 tp:A:S cm:i:3 s1:i:47 dv:f:0.1886 rl:i:53209
Can you speak to what is happening here? Also, do you have a sense of how the order of the input contigs might affect things? In the example I've provided here, I did not assign any particular order. I tried again after sorting in descending order based on contig length and got slightly different results: more contigs, more gaps across more contigs. Do you think this is due to the order of the input contigs or simply a function of inherent variability in the decision-making process or something else?
Hi Dengfeng,
I noticed that commit cb3721f4 provides the -e
option, which would eliminate this problem for the user. Also, commit 28a60101 adds an FAQ about this subject, which is helpful. I just wanted to point out for other people experiencing the unexpected behavior of removing the middle of contigs that they can download the latest commit instead of the latest release and use the -e
flag.
As I took a closer look at your commit messages, I noticed that you have been updating the minor and patch level values and specifying the version in your commit message. I should have realized you had been doing that for a while now, and I should have read through your full commit messages instead of just assuming the latest release would be the best option. I suspect other people would benefit from having an official release/tag added to changes that are made, especially if they provide new functionality that fixes issues raised on GitHub. It would be particularly unfortunate if someone didn't realize that contig midsections were removed as that could adversely affect downstream analysis if not planned for. If I may make a friendly request, would you make a tag for at least one commit that has been pushed since these changes were implemented?
I think this tool is awesome- thank you for your time developing and supporting it!
From the article I see that this tool is made for running on the primary assembly to get rid off the redundant haplotigs, which are sometimes still present, even after falcon-unzip. I am wondering if purge_dups runs as good on an almost doubled genome assembly, resulting from a normal CANU or falcon assembly? We have assemblies with busco duplicate scores of >50%, and we are looking into ways to remove redundancy. A question maybe related, did you try running minimap for the self-to-self alignment with a higher scoring cuttoff to avoid matching short repeats?