ctSkennerton / minced

Mining CRISPRs in Environmental Datasets
GNU General Public License v3.0
99 stars 17 forks source link

minced doesn't find a repeat if it is at the start of the fasta file #35

Open Alan-Collins opened 2 years ago

Alan-Collins commented 2 years ago

Hi,

It seems that MinCED is unable to identify repeats when they are right at the beginning of the fasta file. Adding any nucleotide before the first repeat fixes this.

Attached are two fasta files with the same array. Repeats are lowercase and spacers uppercase. example_array_plusA.txt has a single A added to the beginning of the file.

Using minced 0.4.2 installed from bioconda on the first file (example_array.txt), 7 repeats are found,

$ minced example_array.fna
Sequence 'array' (571 bp)

CRISPR 1   Range: 84 - 571
POSITION        REPEAT                          SPACER
--------        -------------------------------------   --------------------------------------
84              GTCTCAATCCCCCTTACTCAATCGGGTCTGTCTACAC   TATGTGCCGTGACTTCGATGCTGAGTTCAAACAT      [ 37, 34 ]
155             GTCTCAATCCCCCTTACTCAATCGGGTCTGTCTACAC   TTACCCTGTCCGACGCTGACCTGTCCGGCGCTGATCTGTC        [ 37, 40 ]
232             GTCTCAATCCCCCTTACTCAATCGGGTCTGTCTACAC   CGCCGGCGAATCGTTCATGCTCACCCGCGCGGATT     [ 37, 35 ]
304             GTCTCAATCCCCCTTACTCAATCGGGTCTGTCTACAC   AAGCGTCTTTACGGGAGTCGTGGACGACCTGGTCCCGACC        [ 37, 40 ]
381             GTCTCAATCCCCCTTACTCAATCGGGTCTGTCTACAC   CTGACGTGATACCGACCGACATCCTCATGGCGATTCCC  [ 37, 38 ]
456             GTCTCAATCCCCCTTACTCAATCGGGTCTGTCTACAC   ACGCCGCAAAACAGTGCGCCTATAAAGACGATTTTCGTCCCG      [ 37, 42 ]
535             GTCTCAATCCCCCTTACTCAATCGGGTCTGTCTACAC
--------        -------------------------------------   --------------------------------------
Repeats: 7      Average Length: 37              Average Length: 38

Time to find repeats: 2 ms

With a single nucleotide added before the first repeat, 8 repeats are found.

$ minced example_array_plusA.fna
Sequence 'array' (572 bp)

CRISPR 1   Range: 2 - 572
POSITION        REPEAT                          SPACER
--------        -------------------------------------   ---------------------------------------
2               GTCTCAATCCCCCTTACTCAATCGGGTCTGTCTACAC   AAACAATACAAACTACATCTACTGTAACACTTTCACTTGATAGCAA  [ 37, 46 ]
85              GTCTCAATCCCCCTTACTCAATCGGGTCTGTCTACAC   TATGTGCCGTGACTTCGATGCTGAGTTCAAACAT      [ 37, 34 ]
156             GTCTCAATCCCCCTTACTCAATCGGGTCTGTCTACAC   TTACCCTGTCCGACGCTGACCTGTCCGGCGCTGATCTGTC        [ 37, 40 ]
233             GTCTCAATCCCCCTTACTCAATCGGGTCTGTCTACAC   CGCCGGCGAATCGTTCATGCTCACCCGCGCGGATT     [ 37, 35 ]
305             GTCTCAATCCCCCTTACTCAATCGGGTCTGTCTACAC   AAGCGTCTTTACGGGAGTCGTGGACGACCTGGTCCCGACC        [ 37, 40 ]
382             GTCTCAATCCCCCTTACTCAATCGGGTCTGTCTACAC   CTGACGTGATACCGACCGACATCCTCATGGCGATTCCC  [ 37, 38 ]
457             GTCTCAATCCCCCTTACTCAATCGGGTCTGTCTACAC   ACGCCGCAAAACAGTGCGCCTATAAAGACGATTTTCGTCCCG      [ 37, 42 ]
536             GTCTCAATCCCCCTTACTCAATCGGGTCTGTCTACAC
--------        -------------------------------------   ---------------------------------------
Repeats: 8      Average Length: 37              Average Length: 39

Time to find repeats: 2 ms

Thanks! Alan

example_array.txt example_array_plusA.txt

ctSkennerton commented 2 years ago

Thank you for the bug report, this is indeed a problem. Unfortunately I don't have time to look into this issue right now.