dotnetbio / bio

Bioinformatics library for .NET
Apache License 2.0
143 stars 49 forks source link

BoyerMoore Updated with example #24

Open agento2 opened 7 years ago

agento2 commented 7 years ago

In testing pattern searches with BoyerMoore i can't get it to return any matches until the search string starts with a *. This also results in position 0 always being the match location. I'm going to try and do further testing but ever the simple example from the cookbook doesn't behave as i would expect.

evolvedmicrobe commented 7 years ago

Eck, that's no good. I haven't used that aspect of the library, will see if I can recreate your example

evolvedmicrobe commented 7 years ago

I didn't dig into the algorithm, but could verify that I didn't need the search string to start with a * or that it always matches at the start. See these examples:

        ISequence sequence = new Sequence (Alphabets.AmbiguousDNA, "GCTAGGTAGCTCAAAAAAGGG");
        IPatternFinder searcher = new BoyerMoore () {
            IgnoreCase = true
        };

        foreach (var pos in searcher.FindMatch (sequence, "GCTA")) {
            Console.WriteLine ("Found Match at: {0}", pos); //0 correct 
        }

        foreach (var pos in searcher.FindMatch (sequence, "GCTCA")) {
            Console.WriteLine ("Found Match at: {0}", pos); // 8 correct
        }
evolvedmicrobe commented 7 years ago

Okay, I can't reproduce this problem after trying a few times so will close this. Feel free to reopen if you can upload an example though.

The wildcard indicates not just one character but 0-infinity of them, so if your search string is within the sequence, you will always have a 0 returned if you lead off with a *, as everything from the start of the sequence on is part of the pattern match.

agento2 commented 7 years ago

I can't reopen, but I did get your example to work but then poking at it since my code still wasn't working i found this

ISequence sequence = new Sequence(Alphabets.AmbiguousDNA, "gcTAGGTAGCTCAAAAAAGGG");
            searcher = new BoyerMoore()
            {
                IgnoreCase = true
            };

            foreach (var pos in searcher.FindMatch(sequence, "GCTA"))
            {
                Console.WriteLine("Found Match at: {0}", pos); //0 correct // Now no match
            }

            foreach (var pos in searcher.FindMatch(sequence, "gCTCA*GGG"))
            {
                Console.WriteLine("Found Match at: {0}", pos); // 8 correct
            }

appears there's some issue with the sequence constructor having lower case letters in it. It is really kind of weird the first one stops matching but the second one is fine. If I make the sequence with a toLower on your string neither matches anymore. And if I toUpper mine it matches. So is this just something that needs documented or is there actually an issue.

agento2 commented 7 years ago

here's an example where it fails. I generated 2500bp of random then a couple short bits and inserted them and it will find the individuals but the two combined with a wild card in the middle fails.

namespace matchTest
{
    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Text.RegularExpressions;
    using System.Threading.Tasks;
    using Bio;
    using Bio.Algorithms.StringSearch;
    using Bio.Extensions;

    class Program
    {
        static void Main(string[] args)
        {
            var sample = new Sequence(Alphabets.DNA, Seq.ToUpper());
            IPatternFinder searcher = new BoyerMoore()
            {
                IgnoreCase = true
            };
            var boyerMatches = searcher.FindMatch(sample, search1);
            foreach (var pos in boyerMatches)
            {
                Console.WriteLine("Found Match at: {0}", pos); // 11
            }
            Console.WriteLine("end 1");

            boyerMatches = searcher.FindMatch(sample, search2);
            foreach (var pos in boyerMatches)
            {
                Console.WriteLine("Found Match at: {0}", pos); // 36
            }
            Console.WriteLine("end 2");

            boyerMatches = searcher.FindMatch(sample, string.Format("{0}*{1}", search1, search2)); 
            foreach (var pos in boyerMatches)
            {
                Console.WriteLine("Found Match at: {0}", pos); //Expect 11 get nothing
            }
            Console.WriteLine("end combo");

            Console.ReadLine();
        }

        private static string search1 = "ATGTCCCACGTGAAACGTTGCT";
        private static string search2 = "AAACCCTCAGGTTTCTGAGCGACA";

        private static string Seq =
            "AAGCTTTAAACATGTCCCACGTGAAACGTTGCTGGGAAACCCTCAGGTTTCTGAGCGACAAGTTCGCGCTCATAACTTGGTCCGAATGCGGGTTCTTGCATCGTTCGACTGAGTTTGTTTCATGTAGAACGGGCGCAAAGTATACTTAGTTCAATCTTCAATACCTCGTATCATTGTACACCTGCCGGTCACCACCCAACGATGTGGGGACGGCGTTGCAACTTCGAGGACCTAATCTGACCGACCTAGATTCGGCACTGTGGGCAATATGAGGTATTGGCAGACACCCAGTGCCGAACAACACCTGACCTAACGGTAAGAGAGTCTCATAATGCGTCCGGCCGCGTGCCCAGGGTATATTTGGACAGTATCGAATGGACTGAGATGAACCTTTACACCGATCCGGAAACGGGTGCGTGGATTAGCCAGGAGCAAACGAAAAATCCTGGGCTACTTGATGTCTTGTGACGTTCTTAGAGATGGACGAAATGTTTCACGACCTAGGATAAGGTCGCCCTACAAAATAGATTTGTGCTACTCTCCTCATAAGCAGTCCGGTGTATCGAAAGAACAAGACTAGCCTTGCTAGCAACCGCGGGCTGGGGGGCTAAGGTATCACTCAAGAAGCAGGCTCGGTAACATACGGTCTAGCTATCTGACTATCGCCTACGTCATATAGGGACCTATGATATCTGCGTGTCCAACCTTAGAATTCACTTCAGCGCGCAGGTTTGGGTCGAGATAAAATCACCAGTACCCAAGACCACGGGCGCTCGGCGCCTTGGCTAATCCCGGTACATGTTGTTATAAATAATCAGTAGAAAATCTGTGTTAGAGGGTCGAGTCACCATATATCAAGAACGATATTAATCGGTGGGAGTATTCAACGTGATGAAGACGCTGGGTTTACGTGGGAATGGTGCTTCTGTCCTAACAGGCTAGGGTATAATGCTGAAACCGTCCCCCAAGCGTTCAGGGTGGGGTTTGCTACGACTTCCGAGTCCAAAGTGTCCGTGTTTTTGATATATACGCTCAAGGGCGAGAATTGGACCTGGCTTACGTCTTAGTACGTAGCATGGTGACACAAGCACAGTAGATCCTGCCCGCGTTTCCTATATATTAAGTTAAATCTTATGGAATATAATAACATGTGGATGGCCAGTGGTCGGTTGTTACACGCCTACCGCAATGCTGAAAGACCCGGACTAGAGTGGCGAGATCTATGGCGTGTGACCCGTTATGCTCCATTTCGGTCAGTGGGTCACAGCTAGTTGTGGATTGGATTGCCATTCTCCGAGTGTTTTAGCGTGACAGCCGCAGGGATCCCATAAAATGCAATCGTAGTCCACCTGATCGTACTTAGAAATGAGGGTCCGCTTTTGCCCACGCACCTGATCGCTCCTCGTTTGCTTTTAAGAACCGGACGAACCACAGAGCATAAGGAGAACCTCTAGCTGCTTTACAAAGTACTGGTTCCCTTTCCAGCGGGATGCTTTATCTAAACGCAATGAGAGAGGTATTCCTCAGGCCACATCGCTTCCTAGTTCCGCTGGGATCCATCGTTGGCGGCCGAAGCCGCCATTCCATAGTGAGTTCTTCGTCTGTGTCATTCTGTGCCAGATCGTCTGGCAAATAGCCGATCCAGTTTATCTCTCGAAACTATAGTCGTACAGATCGAAATCTTAAGTCAAATCACGCGACTAGACTCAGCTCTATTTTAGTGGTCATGGGTTTTGGTCCCCCCGAGCGGTGCAACCGATTAGGACCATGTAGAACATTAGTTATAAGTCTTCTTTTAAACACAATCTTCCTGCTCAGTGGTACATGGTTATCGTTATTGCTAGCCAGCCTGATAAGTAACACCACCACTGCGACCCTAATGCGCCCTTTCCACGAACACAGGGCTGTCCGATCCTATATTACGACTCCGGGAAGGGGTTCGCAAGTCGCACCCTAAACGATGTTGAAGGCTCAGGATGTACACGCACTAGTACAATACATACGTGTTCCGGCTCTTATCCTGCATCGGAAGCTCAATCATGCATCGCACCAGCGTGTTCGTGTCATCTAGGAGGGGCGCGTAGGATAAATAATTCAATTAAGATATCGTTATGCTAGTATACGCCTACCCGTCACCGGCCAACAGTGTGCAGATGGCGCCACGAGTTACTGGCCCTGATTTCTCCGCTTCTAATACCGCACACTGGGCAATACGAGCTCAAGCCAGTCTCGCAGTAACGCTCATCAGCTAACGAAAGAGTTAGAGGCTCGCTAAATCGCACTGTCGGGGTCCCTTGGGTATTTTACACTAGCGTCAGGTAGGCTAGCATGTGTCTTTCCTTCCAGGGGTATGCGGGTGCGTGGACAAATGAGCAGCATACGTATTTACTCGGCGTGCTTGGTCTCTCGTATTTCTCCTGGAGATCAAGGAAATGTTTCATGTCCAAGCGAAAAGCCGCTCTACGGAATGGATCTACGTTACTGCCTGCATAAGGAGACCGGTGTAGCCAAGGACGAAAGCGACCCTAGGTTCTAACCGTCGACTT";
    }
}