dfguan / purge_dups

haplotypic duplication identification tool
MIT License
207 stars 20 forks source link

can purge_dups be run on non primary assemblies, ie normal falcon assemblies #7

Open HenrivdGeest opened 5 years ago

HenrivdGeest commented 5 years ago

From the article I see that this tool is made for running on the primary assembly to get rid off the redundant haplotigs, which are sometimes still present, even after falcon-unzip. I am wondering if purge_dups runs as good on an almost doubled genome assembly, resulting from a normal CANU or falcon assembly? We have assemblies with busco duplicate scores of >50%, and we are looking into ways to remove redundancy. A question maybe related, did you try running minimap for the self-to-self alignment with a higher scoring cuttoff to avoid matching short repeats?

dfguan commented 5 years ago

Hi Henri, the results depends on the heterozygosity and repeats of your assemblies, even my tests are based on primary assemblies. I would recommend you to try and see how the busco scores changes. You may need to change some parameters, but without seeing the results, it's hard to say whether purge_dups can work. As for the self-to-self alignment, purge_dups has an option (-l) which can filter out those matching short repeats.

HenrivdGeest commented 5 years ago

Hi, I tried in to a plant assembly which shows a high busco duplicates scores of >50%. The purged.fa show a good reduction to 10% duplicate, which is a good reduction. But My feeling is that it should be possible to purge more. However I am now struggling with understanding the parameters, mainly on the purge_dups scripts. Can you elaborate or point me to (in case I missed it) a bit more descriptive manual?

/bin/purge_dups
Usage: update [options] <PAF>
Options:
         -c    STR      base-level coverage file [NULL]
         -T    STR      cutoffs file [NULL]
         **-f    INT      minimum fraction of haploid/diploid/bad/repetitive bases in a sequence [.8]**
         -a    INT      minimum alignment score [50]
         -b    INT      minimum max match score [200]
         -2    BOOL     2 rounds chaining [FALSE]
         -m    INT      minimum matching bases for chaining [500]
         -M    INT      maximum gap size for chaining [20K]
         -G    INT      maximum gap size for 2nd round chaining [50K]
         -l    INT      minimum alignment block for an overlap [10K]
         **-E    INT      maximum extension for contig ends [15K]**
         -h             help

for me the -f, -G and -E are the once which I want to change, but I do not fully understand them. The coverage plot looks like: image and the cutoffs file: 5 7 49 50 60 165

dfguan commented 5 years ago

Hi Henri, -f is set for a suspect haplotigs, if 80% of a scaffold is high covrage (coverage > 165 in your case), it's a repetitive contig, 80% is low coverage (coverage < 5), it's a junk contig, 80% is above diploid coverage(50 in your case), it's a diploid, otherwise it's a suspect haplotig. -G is set for second round chaining, in the first round, asset chains consistent alignments within 20 kb, the second round 50 kb. -E is the match extension, if the chained alignment is within 15 kb to the contig ends, it will extended to the ends.

-l is for controling overlap size, you can also decrease its value to allow more overlaps.

How does purge_haplotigs work on your assembly?

Dengfeng

HenrivdGeest commented 5 years ago

It did remove many duplicated contigs, But I don't really like the fact that it also merges contigs. After I aligned the old contigs and the long reads, I could not find evidence that the contigs should have been merged. It might be correct in the end, but I rather not touch the contigs at all, just remove the redundant ones. We now moved back to using purge_haplotigs again.

dfguan commented 5 years ago

Hi Henri, what do you mean by merging contigs? Could you please give me an example, say dotplot to show the contig before and after merging? That would be helpful for me to update purge_dups. Thanks. Dengfeng.

pickettbd commented 3 years ago

Hi Dengfeng,

I may be experiencing a similar issue here. I have a gap free assembly being used as input to the purge_dups pipeline. The assembly consists of primary contigs based on PacBio reads assembled with canu. The assembly stats look pretty good (e.g., NG50 >7Mb, LG50 24). I was able to run the pipeline pretty smoothly; the cutoffs chosen were very reasonable (I had two clear peaks with a valley between and haploid/diploid cutoff was in the bottom of the valley). When the whole process was done, I was surprised to see gaps in the final.purged.fa file. 16 gaps were generated across 12 sequences. Each gap was exactly 23 N's long. At first glance, my assumption was that two shorter contigs were merged together with a gap. I have investigated more closely and realized that this is not the case. I wonder if Henri saw the gaps and made the same assumption initially.

Regardless, I have determined that this wasn't a merging of sequence, but rather a deletion of a large chunk which was replaced by a small number of Ns. For example, in one contig which had 194,313 bases, 48,279 bases (109,301-157,580) were excised and replaced with 23 Ns. What is the justification for something like this happening?

pickettbd commented 3 years ago

I took a closer look at things and discovered the following relevant lines in the dups.bed file:

tig00002458 0   64876   REPEAT  tig00002442
tig00000284 0   188425  REPEAT  tig00000287
tig00000287 109300  157579  OVLP    tig00002442
tig00000297 0   78604   REPEAT  tig00000287
tig00002702 0   79150   REPEAT  tig00000287

The example I mentioned in my previous comment occurred in tig00000287.

The contigs in the lines I showed above are not present on any other line in the dups.bed file. Of the contigs listed, none are present in final.hap.fa. Only two, tig00000287 and tig00002442 are present in final.purged.fa. Here are the lengths of each of these contigs in the original contig fasta file:

     contig length
----------------------
tig00000284 188425
tig00000287 194313
tig00000297 78604
tig00002442 575628
tig00002458 64876
tig00002702 79150

Note that in final.purged.fa, tig00002442 has no deletion. Here are the alignments from the paf file between contigs:

tig00000287:1-194313    194313  109300  127490  -   tig00002442:1-575628    575628  33058   51370   12707   18322   0   tp:A:S  cm:i:1083   s1:i:12702  dv:f:0.0217 rl:i:52972
tig00000287:1-194313    194313  112590  130985  -   tig00002442:1-575628    575628  33058   51581   12026   18540   0   tp:A:S  cm:i:1009   s1:i:12014  dv:f:0.0260 rl:i:52972
tig00000287:1-194313    194313  116096  134453  -   tig00002442:1-575628    575628  33058   51581   11156   18548   0   tp:A:S  cm:i:922    s1:i:11132  dv:f:0.0307 rl:i:52972
tig00000287:1-194313    194313  109300  124014  -   tig00002442:1-575628    575628  33058   47847   10525   14799   0   tp:A:S  cm:i:903    s1:i:10521  dv:f:0.0203 rl:i:52972
tig00000287:1-194313    194313  119600  134657  -   tig00002442:1-575628    575628  36370   51581   8723    15234   0   tp:A:S  cm:i:707    s1:i:8702   dv:f:0.0343 rl:i:52972
tig00000287:1-194313    194313  109300  120526  -   tig00002442:1-575628    575628  33058   44332   8161    11279   0   tp:A:S  cm:i:702    s1:i:8158   dv:f:0.0194 rl:i:52972
tig00000287:1-194313    194313  123096  134657  -   tig00002442:1-575628    575628  39887   51581   6214    11716   0   tp:A:S  cm:i:491    s1:i:6195   dv:f:0.0397 rl:i:52972
tig00000287:1-194313    194313  109300  117026  -   tig00002442:1-575628    575628  33058   40814   5595    7764    0   tp:A:S  cm:i:475    s1:i:5594   dv:f:0.0200 rl:i:52972
tig00000287:1-194313    194313  126566  134657  -   tig00002442:1-575628    575628  43403   51581   4029    8198    0   tp:A:S  cm:i:308    s1:i:4013   dv:f:0.0453 rl:i:52972
tig00000287:1-194313    194313  109300  113519  -   tig00002442:1-575628    575628  33058   37299   2953    4245    0   tp:A:S  cm:i:247    s1:i:2952   dv:f:0.0231 rl:i:52972
tig00000287:1-194313    194313  130131  134657  -   tig00002442:1-575628    575628  46919   51521   2108    4609    0   tp:A:S  cm:i:159    s1:i:2094   dv:f:0.0502 rl:i:52972
tig00000287:1-194313    194313  156569  157579  -   tig00002442:1-575628    575628  605 1616    599 1011    0   tp:A:S  cm:i:51 s1:i:599    dv:f:0.0128 rl:i:52972
tig00000287:1-194313    194313  156569  157579  -   tig00002442:1-575628    575628  1727    2737    588 1011    0   tp:A:S  cm:i:50 s1:i:588    dv:f:0.0138 rl:i:52972
tig00000287:1-194313    194313  109300  110015  -   tig00002442:1-575628    575628  33058   33783   457 725 0   tp:A:S  cm:i:38 s1:i:456    dv:f:0.0329 rl:i:52972
tig00000287:1-194313    194313  133616  134657  -   tig00002442:1-575628    575628  50440   51521   402 1081    0   tp:A:S  cm:i:29 s1:i:393    dv:f:0.0667 rl:i:52972
tig00000287:1-194313    194313  48402   49502   -   tig00002442:1-575628    575628  57953   59035   362 1102    0   tp:A:S  cm:i:31 s1:i:357    dv:f:0.0662 rl:i:52972
tig00000287:1-194313    194313  138276  141770  -   tig00002442:1-575628    575628  52166   55689   217 3523    0   tp:A:S  cm:i:14 s1:i:211    dv:f:0.1552 rl:i:52972
tig00000287:1-194313    194313  150981  154479  -   tig00002442:1-575628    575628  52166   55689   205 3523    0   tp:A:S  cm:i:13 s1:i:201    dv:f:0.1593 rl:i:52972
tig00000287:1-194313    194313  137710  141194  -   tig00002442:1-575628    575628  52166   55677   205 3511    0   tp:A:S  cm:i:13 s1:i:198    dv:f:0.1589 rl:i:52972
tig00000287:1-194313    194313  151559  155056  -   tig00002442:1-575628    575628  52166   55689   193 3523    0   tp:A:S  cm:i:12 s1:i:187    dv:f:0.1635 rl:i:52972
tig00000287:1-194313    194313  149826  153324  -   tig00002442:1-575628    575628  52166   55689   186 3523    0   tp:A:S  cm:i:12 s1:i:182    dv:f:0.1635 rl:i:52972
tig00000287:1-194313    194313  142892  146390  -   tig00002442:1-575628    575628  52166   55689   186 3523    0   tp:A:S  cm:i:12 s1:i:181    dv:f:0.1635 rl:i:52972
tig00000287:1-194313    194313  145203  148700  -   tig00002442:1-575628    575628  52166   55689   186 3523    0   tp:A:S  cm:i:12 s1:i:181    dv:f:0.1633 rl:i:52972
tig00000287:1-194313    194313  138853  141770  -   tig00002442:1-575628    575628  52747   55689   186 2942    0   tp:A:S  cm:i:12 s1:i:179    dv:f:0.1545 rl:i:52972
tig00000287:1-194313    194313  140586  144079  -   tig00002442:1-575628    575628  52166   55689   186 3523    0   tp:A:S  cm:i:12 s1:i:179    dv:f:0.1627 rl:i:52972
tig00000287:1-194313    194313  140008  143501  -   tig00002442:1-575628    575628  52166   55689   186 3523    0   tp:A:S  cm:i:12 s1:i:179    dv:f:0.1627 rl:i:52972
tig00000287:1-194313    194313  139431  142923  -   tig00002442:1-575628    575628  52166   55689   186 3523    0   tp:A:S  cm:i:12 s1:i:179    dv:f:0.1631 rl:i:52972
tig00000287:1-194313    194313  150981  153901  -   tig00002442:1-575628    575628  52166   55107   174 2941    0   tp:A:S  cm:i:11 s1:i:170    dv:f:0.1586 rl:i:52972
tig00000287:1-194313    194313  137710  140617  -   tig00002442:1-575628    575628  52166   55095   174 2929    0   tp:A:S  cm:i:11 s1:i:170    dv:f:0.1581 rl:i:52972
tig00000287:1-194313    194313  152137  155056  -   tig00002442:1-575628    575628  52747   55689   162 2942    0   tp:A:S  cm:i:10 s1:i:157    dv:f:0.1636 rl:i:52972
tig00000287:1-194313    194313  136559  140039  -   tig00002442:1-575628    575628  52166   55677   162 3511    0   tp:A:S  cm:i:10 s1:i:155    dv:f:0.1721 rl:i:52972
tig00000287:1-194313    194313  152715  156210  -   tig00002442:1-575628    575628  52166   55689   162 3523    0   tp:A:S  cm:i:10 s1:i:155    dv:f:0.1735 rl:i:52972
tig00000287:1-194313    194313  142892  145812  -   tig00002442:1-575628    575628  52166   55107   155 2941    0   tp:A:S  cm:i:10 s1:i:152    dv:f:0.1636 rl:i:52972
tig00000287:1-194313    194313  149826  152746  -   tig00002442:1-575628    575628  52166   55107   155 2941    0   tp:A:S  cm:i:10 s1:i:151    dv:f:0.1636 rl:i:52972
tig00000287:1-194313    194313  145203  148122  -   tig00002442:1-575628    575628  52166   55107   155 2941    0   tp:A:S  cm:i:10 s1:i:151    dv:f:0.1634 rl:i:52972
tig00000287:1-194313    194313  148669  152168  -   tig00002442:1-575628    575628  52166   55689   155 3523    0   tp:A:S  cm:i:10 s1:i:150    dv:f:0.1731 rl:i:52972
tig00000287:1-194313    194313  143470  146390  -   tig00002442:1-575628    575628  52747   55689   155 2942    0   tp:A:S  cm:i:10 s1:i:150    dv:f:0.1636 rl:i:52972
tig00000287:1-194313    194313  144048  147544  -   tig00002442:1-575628    575628  52166   55689   155 3523    0   tp:A:S  cm:i:10 s1:i:150    dv:f:0.1729 rl:i:52972
tig00000287:1-194313    194313  148091  151590  -   tig00002442:1-575628    575628  52166   55689   155 3523    0   tp:A:S  cm:i:10 s1:i:150    dv:f:0.1731 rl:i:52972
tig00000287:1-194313    194313  145781  148700  -   tig00002442:1-575628    575628  52747   55689   155 2942    0   tp:A:S  cm:i:10 s1:i:150    dv:f:0.1634 rl:i:52972
tig00000287:1-194313    194313  141739  145234  -   tig00002442:1-575628    575628  52166   55689   155 3523    0   tp:A:S  cm:i:10 s1:i:149    dv:f:0.1723 rl:i:52972
tig00000287:1-194313    194313  147513  151012  -   tig00002442:1-575628    575628  52166   55689   155 3523    0   tp:A:S  cm:i:10 s1:i:149    dv:f:0.1731 rl:i:52972
tig00000287:1-194313    194313  146359  149857  -   tig00002442:1-575628    575628  52166   55689   155 3523    0   tp:A:S  cm:i:10 s1:i:149    dv:f:0.1729 rl:i:52972
tig00000287:1-194313    194313  141163  144079  -   tig00002442:1-575628    575628  52747   55689   155 2942    0   tp:A:S  cm:i:10 s1:i:148    dv:f:0.1627 rl:i:52972
tig00000287:1-194313    194313  136559  139462  -   tig00002442:1-575628    575628  52166   55095   131 2929    0   tp:A:S  cm:i:8  s1:i:126    dv:f:0.1737 rl:i:52972
tig00000287:1-194313    194313  153293  156210  -   tig00002442:1-575628    575628  52747   55689   131 2942    0   tp:A:S  cm:i:8  s1:i:125    dv:f:0.1758 rl:i:52972
tig00000287:1-194313    194313  147513  149857  -   tig00002442:1-575628    575628  52747   55107   124 2360    0   tp:A:S  cm:i:8  s1:i:121    dv:f:0.1639 rl:i:52972
tig00000287:1-194313    194313  153882  156210  -   tig00002442:1-575628    575628  53329   55677   100 2348    0   tp:A:S  cm:i:6  s1:i:95 dv:f:0.1793 rl:i:52972
tig00000287:1-194313    194313  136559  138884  -   tig00002442:1-575628    575628  52166   54512   100 2346    0   tp:A:S  cm:i:6  s1:i:95 dv:f:0.1769 rl:i:52972
tig00000287:1-194313    194313  154448  156210  -   tig00002442:1-575628    575628  53912   55689   81  1777    0   tp:A:S  cm:i:5  s1:i:78 dv:f:0.1746 rl:i:52972
tig00000287:1-194313    194313  136559  138307  -   tig00002442:1-575628    575628  52166   53931   69  1765    0   tp:A:S  cm:i:4  s1:i:66 dv:f:0.1840 rl:i:52972
tig00000287:1-194313    194313  155037  156210  -   tig00002442:1-575628    575628  54493   55677   50  1184    0   tp:A:S  cm:i:3  s1:i:47 dv:f:0.1807 rl:i:52972
tig00002442:1-575628    575628  33058   51370   -   tig00000287:1-194313    194313  109300  127490  12707   18322   0   tp:A:S  cm:i:1083   s1:i:12702  dv:f:0.0203 rl:i:53209
tig00002442:1-575628    575628  33058   51581   -   tig00000287:1-194313    194313  112590  130985  12026   18540   0   tp:A:S  cm:i:1009   s1:i:12014  dv:f:0.0248 rl:i:53209
tig00002442:1-575628    575628  33058   51581   -   tig00000287:1-194313    194313  116096  134453  11156   18548   0   tp:A:S  cm:i:922    s1:i:11132  dv:f:0.0295 rl:i:53209
tig00002442:1-575628    575628  33058   47847   -   tig00000287:1-194313    194313  109300  124014  10525   14799   0   tp:A:S  cm:i:903    s1:i:10521  dv:f:0.0186 rl:i:53209
tig00002442:1-575628    575628  36370   51581   -   tig00000287:1-194313    194313  119600  134657  8723    15234   0   tp:A:S  cm:i:707    s1:i:8702   dv:f:0.0332 rl:i:53209
tig00002442:1-575628    575628  33058   44332   -   tig00000287:1-194313    194313  109300  120526  8161    11279   0   tp:A:S  cm:i:702    s1:i:8158   dv:f:0.0176 rl:i:53209
tig00002442:1-575628    575628  39887   51581   -   tig00000287:1-194313    194313  123096  134657  6214    11716   0   tp:A:S  cm:i:491    s1:i:6195   dv:f:0.0385 rl:i:53209
tig00002442:1-575628    575628  33058   40814   -   tig00000287:1-194313    194313  109300  117026  5595    7764    0   tp:A:S  cm:i:475    s1:i:5594   dv:f:0.0187 rl:i:53209
tig00002442:1-575628    575628  43403   51581   -   tig00000287:1-194313    194313  126566  134657  4029    8198    0   tp:A:S  cm:i:308    s1:i:4013   dv:f:0.0445 rl:i:53209
tig00002442:1-575628    575628  33058   37299   -   tig00000287:1-194313    194313  109300  113519  2953    4245    0   tp:A:S  cm:i:247    s1:i:2952   dv:f:0.0214 rl:i:53209
tig00002442:1-575628    575628  46919   51521   -   tig00000287:1-194313    194313  130131  134657  2108    4609    0   tp:A:S  cm:i:159    s1:i:2094   dv:f:0.0492 rl:i:53209
tig00002442:1-575628    575628  605 1616    -   tig00000287:1-194313    194313  156569  157579  599 1011    0   tp:A:S  cm:i:51 s1:i:599    dv:f:0.0128 rl:i:53209
tig00002442:1-575628    575628  1727    2737    -   tig00000287:1-194313    194313  156569  157579  588 1011    0   tp:A:S  cm:i:50 s1:i:588    dv:f:0.0138 rl:i:53209
tig00002442:1-575628    575628  33058   33783   -   tig00000287:1-194313    194313  109300  110015  457 725 0   tp:A:S  cm:i:38 s1:i:456    dv:f:0.0291 rl:i:53209
tig00002442:1-575628    575628  50440   51521   -   tig00000287:1-194313    194313  133616  134657  402 1081    0   tp:A:S  cm:i:29 s1:i:393    dv:f:0.0641 rl:i:53209
tig00002442:1-575628    575628  57953   59035   -   tig00000287:1-194313    194313  48402   49502   362 1102    0   tp:A:S  cm:i:31 s1:i:357    dv:f:0.0712 rl:i:53209
tig00002442:1-575628    575628  52166   55689   -   tig00000287:1-194313    194313  138276  141770  217 3523    0   tp:A:S  cm:i:14 s1:i:211    dv:f:0.1649 rl:i:53209
tig00002442:1-575628    575628  52166   55689   -   tig00000287:1-194313    194313  150981  154479  205 3523    0   tp:A:S  cm:i:13 s1:i:201    dv:f:0.1688 rl:i:53209
tig00002442:1-575628    575628  52166   55677   -   tig00000287:1-194313    194313  137710  141194  205 3511    0   tp:A:S  cm:i:13 s1:i:198    dv:f:0.1686 rl:i:53209
tig00002442:1-575628    575628  52166   55689   -   tig00000287:1-194313    194313  151559  155056  193 3523    0   tp:A:S  cm:i:12 s1:i:187    dv:f:0.1730 rl:i:53209
tig00002442:1-575628    575628  52166   55689   -   tig00000287:1-194313    194313  149826  153324  186 3523    0   tp:A:S  cm:i:12 s1:i:182    dv:f:0.1730 rl:i:53209
tig00002442:1-575628    575628  52166   55689   -   tig00000287:1-194313    194313  145203  148700  186 3523    0   tp:A:S  cm:i:12 s1:i:181    dv:f:0.1730 rl:i:53209
tig00002442:1-575628    575628  52166   55689   -   tig00000287:1-194313    194313  142892  146390  186 3523    0   tp:A:S  cm:i:12 s1:i:181    dv:f:0.1730 rl:i:53209
tig00002442:1-575628    575628  52747   55689   -   tig00000287:1-194313    194313  138853  141770  186 2942    0   tp:A:S  cm:i:12 s1:i:179    dv:f:0.1635 rl:i:53209
tig00002442:1-575628    575628  52166   55689   -   tig00000287:1-194313    194313  140586  144079  186 3523    0   tp:A:S  cm:i:12 s1:i:179    dv:f:0.1730 rl:i:53209
tig00002442:1-575628    575628  52166   55689   -   tig00000287:1-194313    194313  140008  143501  186 3523    0   tp:A:S  cm:i:12 s1:i:179    dv:f:0.1730 rl:i:53209
tig00002442:1-575628    575628  52166   55689   -   tig00000287:1-194313    194313  139431  142923  186 3523    0   tp:A:S  cm:i:12 s1:i:179    dv:f:0.1730 rl:i:53209
tig00002442:1-575628    575628  52166   55107   -   tig00000287:1-194313    194313  150981  153901  174 2941    0   tp:A:S  cm:i:11 s1:i:170    dv:f:0.1683 rl:i:53209
tig00002442:1-575628    575628  52166   55095   -   tig00000287:1-194313    194313  137710  140617  174 2929    0   tp:A:S  cm:i:11 s1:i:170    dv:f:0.1681 rl:i:53209
tig00002442:1-575628    575628  52747   55689   -   tig00000287:1-194313    194313  152137  155056  162 2942    0   tp:A:S  cm:i:10 s1:i:157    dv:f:0.1731 rl:i:53209
tig00002442:1-575628    575628  52166   55689   -   tig00000287:1-194313    194313  152715  156210  162 3523    0   tp:A:S  cm:i:10 s1:i:155    dv:f:0.1826 rl:i:53209
tig00002442:1-575628    575628  52166   55677   -   tig00000287:1-194313    194313  136559  140039  162 3511    0   tp:A:S  cm:i:10 s1:i:155    dv:f:0.1824 rl:i:53209
tig00002442:1-575628    575628  52166   55107   -   tig00000287:1-194313    194313  142892  145812  155 2941    0   tp:A:S  cm:i:10 s1:i:152    dv:f:0.1733 rl:i:53209
tig00002442:1-575628    575628  52166   55107   -   tig00000287:1-194313    194313  149826  152746  155 2941    0   tp:A:S  cm:i:10 s1:i:151    dv:f:0.1733 rl:i:53209
tig00002442:1-575628    575628  52166   55107   -   tig00000287:1-194313    194313  145203  148122  155 2941    0   tp:A:S  cm:i:10 s1:i:151    dv:f:0.1733 rl:i:53209
tig00002442:1-575628    575628  52747   55689   -   tig00000287:1-194313    194313  145781  148700  155 2942    0   tp:A:S  cm:i:10 s1:i:150    dv:f:0.1731 rl:i:53209
tig00002442:1-575628    575628  52166   55689   -   tig00000287:1-194313    194313  148091  151590  155 3523    0   tp:A:S  cm:i:10 s1:i:150    dv:f:0.1826 rl:i:53209
tig00002442:1-575628    575628  52166   55689   -   tig00000287:1-194313    194313  148669  152168  155 3523    0   tp:A:S  cm:i:10 s1:i:150    dv:f:0.1826 rl:i:53209
tig00002442:1-575628    575628  52166   55689   -   tig00000287:1-194313    194313  144048  147544  155 3523    0   tp:A:S  cm:i:10 s1:i:150    dv:f:0.1826 rl:i:53209
tig00002442:1-575628    575628  52747   55689   -   tig00000287:1-194313    194313  143470  146390  155 2942    0   tp:A:S  cm:i:10 s1:i:150    dv:f:0.1731 rl:i:53209
tig00002442:1-575628    575628  52166   55689   -   tig00000287:1-194313    194313  141739  145234  155 3523    0   tp:A:S  cm:i:10 s1:i:149    dv:f:0.1826 rl:i:53209
tig00002442:1-575628    575628  52166   55689   -   tig00000287:1-194313    194313  147513  151012  155 3523    0   tp:A:S  cm:i:10 s1:i:149    dv:f:0.1826 rl:i:53209
tig00002442:1-575628    575628  52166   55689   -   tig00000287:1-194313    194313  146359  149857  155 3523    0   tp:A:S  cm:i:10 s1:i:149    dv:f:0.1826 rl:i:53209
tig00002442:1-575628    575628  52747   55689   -   tig00000287:1-194313    194313  141163  144079  155 2942    0   tp:A:S  cm:i:10 s1:i:148    dv:f:0.1731 rl:i:53209
tig00002442:1-575628    575628  52166   55095   -   tig00000287:1-194313    194313  136559  139462  131 2929    0   tp:A:S  cm:i:8  s1:i:126    dv:f:0.1848 rl:i:53209
tig00002442:1-575628    575628  52747   55689   -   tig00000287:1-194313    194313  153293  156210  131 2942    0   tp:A:S  cm:i:8  s1:i:125    dv:f:0.1848 rl:i:53209
tig00002442:1-575628    575628  52747   55107   -   tig00000287:1-194313    194313  147513  149857  124 2360    0   tp:A:S  cm:i:8  s1:i:121    dv:f:0.1735 rl:i:53209
tig00002442:1-575628    575628  53329   55677   -   tig00000287:1-194313    194313  153882  156210  100 2348    0   tp:A:S  cm:i:6  s1:i:95 dv:f:0.1884 rl:i:53209
tig00002442:1-575628    575628  52166   54512   -   tig00000287:1-194313    194313  136559  138884  100 2346    0   tp:A:S  cm:i:6  s1:i:95 dv:f:0.1884 rl:i:53209
tig00002442:1-575628    575628  53912   55689   -   tig00000287:1-194313    194313  154448  156210  81  1777    0   tp:A:S  cm:i:5  s1:i:78 dv:f:0.1834 rl:i:53209
tig00002442:1-575628    575628  52166   53931   -   tig00000287:1-194313    194313  136559  138307  69  1765    0   tp:A:S  cm:i:4  s1:i:66 dv:f:0.1945 rl:i:53209
tig00002442:1-575628    575628  54493   55677   -   tig00000287:1-194313    194313  155037  156210  50  1184    0   tp:A:S  cm:i:3  s1:i:47 dv:f:0.1886 rl:i:53209

Can you speak to what is happening here? Also, do you have a sense of how the order of the input contigs might affect things? In the example I've provided here, I did not assign any particular order. I tried again after sorting in descending order based on contig length and got slightly different results: more contigs, more gaps across more contigs. Do you think this is due to the order of the input contigs or simply a function of inherent variability in the decision-making process or something else?

pickettbd commented 3 years ago

Hi Dengfeng,

I noticed that commit cb3721f4 provides the -e option, which would eliminate this problem for the user. Also, commit 28a60101 adds an FAQ about this subject, which is helpful. I just wanted to point out for other people experiencing the unexpected behavior of removing the middle of contigs that they can download the latest commit instead of the latest release and use the -e flag.

As I took a closer look at your commit messages, I noticed that you have been updating the minor and patch level values and specifying the version in your commit message. I should have realized you had been doing that for a while now, and I should have read through your full commit messages instead of just assuming the latest release would be the best option. I suspect other people would benefit from having an official release/tag added to changes that are made, especially if they provide new functionality that fixes issues raised on GitHub. It would be particularly unfortunate if someone didn't realize that contig midsections were removed as that could adversely affect downstream analysis if not planned for. If I may make a friendly request, would you make a tag for at least one commit that has been pushed since these changes were implemented?

I think this tool is awesome- thank you for your time developing and supporting it!