ctSkennerton / minced

Mining CRISPRs in Environmental Datasets
GNU General Public License v3.0
99 stars 17 forks source link

High copy number repeats only found with --minNR 2 #24

Closed iimog closed 5 years ago

iimog commented 5 years ago

Running minced on a complete Prochlorococcus genome with default parameters ends up empty. When setting --minNR 2 results are reported with Repeat counts of 4, 6, 4, and 11. From how I interpret the --minNR parameter I expected all of those results to be found with default parameters (--minNR 3) as well. I'm using minced 0.3.2 with openjdk version "1.8.0_152-release" on NixOS 18.9 at Linux 4.14.91

$ minced CP007754.1.fasta # no results
$ minced --minNR 2 CP007754.1.fasta
Sequence 'CP007754.1' (1929203 bp)

CRISPR 1   Range: 16844 - 17082
POSITION    REPEAT              SPACER
--------    ----------------------- -------------------------------------------------
16844       GGCTATGGCGGTGGCGGTCAAGG CGGCTACGGCGGTGGCGGTCAAGGCGGCTACGGCGGTGGCGGTCAAGGC   [ 23, 49 ]
16916       GGCTATGGCGGTGGCGGTCAAGG TGGCTACGGCGGTGGCGGTCAAGGTGGCTACGGCGGTGGCGGTCAAGGT   [ 23, 49 ]
16988       GGCTACGGCGGTGGCGGTCAAGG CGGCTACGGCGGTGGCGGTCAAGGTGGCTACGGCGGTGGCGGTCAAGGC   [ 23, 49 ]
17060       GGCTACGGCGGTGGCGGCTATGG 
--------    ----------------------- -------------------------------------------------
Repeats: 4  Average Length: 23      Average Length: 49

CRISPR 2   Range: 1305091 - 1305440
POSITION    REPEAT              SPACER
--------    --------------------------- -------------------------------------
1305091     GAGGTAATTCCTGAACCTGAGGTAATT CCTGAACCTGAGGTAATTTCTGAACCTGAGGCAACTCCTGAACCT   [ 27, 45 ]
1305163     GAGGCAACTCCTGAACCTGAGGCAACT CCTGAACCTGAGGCAACTCCTGAACCT [ 27, 27 ]
1305217     GAGGCAACTCCTGAACCTGAGGCAACT CCTGAACCTGAGGAACTCCTGAACCTGAGGTAACTTCTGAACCTGAGGTAACTTCTGAACCT  [ 27, 62 ]
1305306     GAGGCAACTCCTGAACCTGAGGCAACT CCTGAACCTGAGGTAATTTCTGAACCT [ 27, 27 ]
1305360     GAGGCAACTCCTGAACCTGAGGTAATT TCTGAACCTGAGGCAACTCCTGAACCT [ 27, 27 ]
1305414     GAGGTAATGCCTGAACCTGAGGTAATT 
--------    --------------------------- -------------------------------------
Repeats: 6  Average Length: 27      Average Length: 37

CRISPR 3   Range: 1372942 - 1373205
POSITION    REPEAT              SPACER
--------    ---------------------------------   --------------------------------------------
1372942     AGGCTTATCTTCCTCAATAGGCTTATCTTCCTC   AATAGGCTTATCTTCCTCAATAGGCTTATCTTCCTC    [ 33, 36 ]
1373011     AGGCTTATCTTCCTCAACAGGCTTATCTTCCTC   AACAGGCTTATCTTCCTCAACAGGCTTATCTTCCTCAAC [ 33, 39 ]
1373083     AGGCTTATCTTCCTCAACAGGCTTATCTTCCTC   AACAGGCTTATCTTCCTCAACAGGCTTATCTTCCTCAATAGGCTTATCTTCCTCAAT   [ 33, 57 ]
1373173     AGGCTTATCTTCCTCAACAGGCTTATCTTCCTC   
--------    ---------------------------------   --------------------------------------------
Repeats: 4  Average Length: 33      Average Length: 44

CRISPR 4   Range: 1666327 - 1667110
POSITION    REPEAT              SPACER
--------    ----------------------------    -----------------------------------------------
1666327     CTTTCTTCTCTGCTGCAGCTTTTTTATC    TGCTGCGGCTTTCTTATTAGCTGCGG  [ 28, 26 ]
1666381     CTTTCTTATCTGCTGCGGCTTTCTTATC    CGCTAAAGCTTTCTTCTCTGCTGCGGCTTTCTTATCTGCTGCAG    [ 28, 44 ]
1666453     CTTTTTTATCTGCTGCGGCTTTCTTATT    AGCTGCGGCTTTTTTATCTGCTGCGGCTTTCTTATCCGCTAAAG    [ 28, 44 ]
1666525     CTTTCTTATCTGCTGCGGCTTTCTTATC    TGCTGCGGCTTTCTTATCCGCTAAAG  [ 28, 26 ]
1666579     CTTTCTTATCTGCTGCGGCTTTCTTATC    CGCTAAAGCTTTCTTCTCTGCTGCGGCTTTCTTCTCTGCTGCAG    [ 28, 44 ]
1666651     CTTTTTTATCTGCTGCGGCTTTCTTATC    CGCTAAAGCTTTCTTCTCTGCTGCAGCTTTTTTATCTGCTGCGGCTTTCTTCTCTGCTGCAG  [ 28, 62 ]
1666741     CTTTTTTATCTGCTGCGGCTTTCTTATC    CGCTAAAGCTTTCTTCTCTGCTGCGGCTTTCTTATCCGCTAACG    [ 28, 44 ]
1666813     CTTTCTTCTCTGCTGCGGCTTTCTTATT    AGCTGCGGCTTTCTTATCTGCTGCAGCTTTTTTATCTGCTGCGGCTTTCTTATCTGCTGCGGCTTTCTTATCCGCTAAAG    [ 28, 80 ]
1666921     CTTTCTTCTCTGCTGCGGCTTTCTTATC    TGCTGCGGCTTTCTTATCCGCTAAAGCTTTCTTCTCTGCTGCGGCTTTCTTATCCGCTAACG  [ 28, 62 ]
1667011     CTTTCTTCTCTGCTGCGGCTTTCTTCTC    TGCTAACGCTTTCTTCTCTGCTAAAGCTTTCTTATCTGCTAAAG    [ 28, 44 ]
1667083     CTTTCTTCTCTGCAGCAGCTTTCTTATC    
--------    ----------------------------    -----------------------------------------------
Repeats: 11 Average Length: 28      Average Length: 47

Time to find repeats: 166 ms
ctSkennerton commented 5 years ago

I'm not convinced these are CRISPRs. You can see the "spacer" sequences are suspiciously similar to each other and seem to contain a lot of small repeating motifs. I think this might be that with --minNR 2 the results are a bit too loose. As an alternate source the CRISPRdb also reports no CRISPRs. I think the reason for there appearing to be many spacers is due to a part of the algorithm that attempts to find additional degenerate repeat and spacer pairs at the sides of the array, this in combination with the reduced number of initial repeats is causing some false positives.

I'm sorry for the misleading output; for best results in genomes stick with --minNR 3 or higher.

iimog commented 5 years ago

Thanks for the quick response. I'll stick to --minNR 3 from now on. Thanks again for your nice tool and helpful response.