fclaude / libcds

Compact Data Structures Library
http://libcds.recoded.cl
GNU Lesser General Public License v2.1
124 stars 24 forks source link

Incorrect suffix array range returned by SuffixTreeY.Child() method #2

Closed skuruppu closed 13 years ago

skuruppu commented 13 years ago

I constructed a suffix tree for the string in the following file www.csse.unimelb.edu.au/~kuruppu/testdata/REF.raw.gz

Then I was searching for the pattern 'tctcactaagata', repeatedly using the Child() method, and when it reached the last symbol, the suffix array range was [9879882, 9879885]. The correct suffix array range for this pattern is [9879882, 9879892]. I looked at how the Child() method was implemented and narrowed it down to the call to FChild() as the problem. FChild() should've returned the range [9879882, 9879892] but it returns [9879882, 9879885]. I didn't understand how the npr->find_RMQ() method was implemented to figure out what went wrong.

I constructed the suffix tree as follows:

SuffixTreeY *st = new SuffixTreeY(sequence, seqlen, DAC, CN_NPR, 32);

And I searched for the pattern as follows:

char p[] = "tctcactaagata";
size_t pl, pr, cl, cr;

st->Root(&pl, &pr);
size_t i = 0;
while (p[i])
{
    st->Child(pl, pr, p[i], &cl, &cr);
    if (cl == (size_t)-1 || cr == (size_t)-1)
        break;
    pl = cl;
    pr = cr;
    i++;
}

I hope this gives you enough information.

rcanovas commented 13 years ago

Maybe is just this problem. Child() method receive an unsigned char not a char value.

On Thu, Aug 11, 2011 at 5:52 PM, skuruppu < reply@reply.github.com>wrote:

I constructed a suffix tree for the string in the following file www.csse.unimelb.edu.au/~kuruppu/testdata/REF.raw.gz

Then I was searching for the pattern 'tctcactaagata', repeatedly using the Child() method, and when it reached the last symbol, the suffix array range was [9879882, 9879885]. The correct suffix array range for this pattern is [9879882, 9879892]. I looked at how the Child() method was implemented and narrowed it down to the call to FChild() as the problem. FChild() should've returned the range [9879882, 9879892] but it returns [9879882, 9879885]. I didn't understand how the npr->find_RMQ() method was implemented to figure out what went wrong.

I constructed the suffix tree as follows:

SuffixTreeY *st = new SuffixTreeY(sequence, seqlen, DAC, CN_NPR, 32);

And I searched for the pattern as follows:

char p[] = "tctcactaagata"; size_t pl, pr, cl, cr;

st->Root(&pl, &pr); size_t i = 0; while (p[i]) { st->Child(pl, pr, p[i], &cl, &cr); if (cl == (size_t)-1 || cr == (size_t)-1) break; pl = cl; pr = cr; i++; }

I hope this gives you enough information.

Reply to this email directly or view it on GitHub: https://github.com/fclaude/libcds/issues/2

skuruppu commented 13 years ago

Sorry, I didn't write it here, but I do cast the char to an unsigned char before I send it to Child(). I called it like this:

st->Child(pl, pr, (unsigned char)p[i], &cl, &cr);

So I don't think that's the problem.

2011/8/12 rcanovas < reply@reply.github.com>

Maybe is just this problem. Child() method receive an unsigned char not a char value.

On Thu, Aug 11, 2011 at 5:52 PM, skuruppu < reply@reply.github.com>wrote:

I constructed a suffix tree for the string in the following file www.csse.unimelb.edu.au/~kuruppu/testdata/REF.raw.gz

Then I was searching for the pattern 'tctcactaagata', repeatedly using the Child() method, and when it reached the last symbol, the suffix array range was [9879882, 9879885]. The correct suffix array range for this pattern is [9879882, 9879892]. I looked at how the Child() method was implemented and narrowed it down to the call to FChild() as the problem. FChild() should've returned the range [9879882, 9879892] but it returns [9879882, 9879885]. I didn't understand how the npr->find_RMQ() method was implemented to figure out what went wrong.

I constructed the suffix tree as follows:

SuffixTreeY *st = new SuffixTreeY(sequence, seqlen, DAC, CN_NPR, 32);

And I searched for the pattern as follows:

char p[] = "tctcactaagata"; size_t pl, pr, cl, cr;

st->Root(&pl, &pr); size_t i = 0; while (p[i]) { st->Child(pl, pr, p[i], &cl, &cr); if (cl == (size_t)-1 || cr == (size_t)-1) break; pl = cl; pr = cr; i++; }

I hope this gives you enough information.

Reply to this email directly or view it on GitHub: https://github.com/fclaude/libcds/issues/2

Reply to this email directly or view it on GitHub: https://github.com/fclaude/libcds/issues/2#issuecomment-1786860

rcanovas commented 13 years ago

Ok I finally find what is the problem. The result that your are getting are not bad, the problem is that what you are doing is not what you need. What the method child return is the child of the node [vl,vr] which label path start with the char a * (Child(vl, vr, a, child_l, *child_r)) note that in your case the path label between [9879881, 9879892] and [9879882, 9879892] is of length "len>1", so when so call the method Child(9879882, 9879892, 'a', ...) is actually looking for the child wich label path start with 'a', (so we are looking for this path label "tctcactaagat" + (len-1)chars + 'a').

Sorry if is not too clear. Feel free to come to my office and I could explained to you better =).

On Fri, Aug 12, 2011 at 9:54 AM, skuruppu < reply@reply.github.com>wrote:

Sorry, I didn't write it here, but I do cast the char to an unsigned char before I send it to Child(). I called it like this:

st->Child(pl, pr, (unsigned char)p[i], &cl, &cr);

So I don't think that's the problem.

2011/8/12 rcanovas < reply@reply.github.com>

Maybe is just this problem. Child() method receive an unsigned char not a char value.

On Thu, Aug 11, 2011 at 5:52 PM, skuruppu < reply@reply.github.com>wrote:

I constructed a suffix tree for the string in the following file www.csse.unimelb.edu.au/~kuruppu/testdata/REF.raw.gz

Then I was searching for the pattern 'tctcactaagata', repeatedly using the Child() method, and when it reached the last symbol, the suffix array range was [9879882, 9879885]. The correct suffix array range for this pattern is [9879882, 9879892]. I looked at how the Child() method was implemented and narrowed it down to the call to FChild() as the problem. FChild() should've returned the range [9879882, 9879892] but it returns [9879882, 9879885]. I didn't understand how the npr->find_RMQ() method was implemented to figure out what went wrong.

I constructed the suffix tree as follows:

SuffixTreeY *st = new SuffixTreeY(sequence, seqlen, DAC, CN_NPR, 32);

And I searched for the pattern as follows:

char p[] = "tctcactaagata"; size_t pl, pr, cl, cr;

st->Root(&pl, &pr); size_t i = 0; while (p[i]) { st->Child(pl, pr, p[i], &cl, &cr); if (cl == (size_t)-1 || cr == (size_t)-1) break; pl = cl; pr = cr; i++; }

I hope this gives you enough information.

Reply to this email directly or view it on GitHub: https://github.com/fclaude/libcds/issues/2

Reply to this email directly or view it on GitHub: https://github.com/fclaude/libcds/issues/2#issuecomment-1786860

Reply to this email directly or view it on GitHub: https://github.com/fclaude/libcds/issues/2#issuecomment-1787104

skuruppu commented 13 years ago

Ah yes, I get it. It's because when you take a branch in a suffix tree, that branch may represent a string rather than just a single symbol. I've fixed my code to use SDepth() to check if I need to get a new child node or to search for a string within the node that I'm currently in. Now I get the correct suffix array range :)

Thanks very much for your help and sorry for the trouble.

2011/8/12 rcanovas < reply@reply.github.com>

Ok I finally find what is the problem. The result that your are getting are not bad, the problem is that what you are doing is not what you need. What the method child return is the child of the node [vl,vr] which label path start with the char a * (Child(vl, vr, a, child_l, *child_r)) note that in your case the path label between [9879881, 9879892] and [9879882, 9879892] is of length "len>1", so when so call the method Child(9879882, 9879892, 'a', ...) is actually looking for the child wich label path start with 'a', (so we are looking for this path label "tctcactaagat" + (len-1)chars + 'a').

Sorry if is not too clear. Feel free to come to my office and I could explained to you better =).

On Fri, Aug 12, 2011 at 9:54 AM, skuruppu < reply@reply.github.com>wrote:

Sorry, I didn't write it here, but I do cast the char to an unsigned char before I send it to Child(). I called it like this:

st->Child(pl, pr, (unsigned char)p[i], &cl, &cr);

So I don't think that's the problem.

2011/8/12 rcanovas < reply@reply.github.com>

Maybe is just this problem. Child() method receive an unsigned char not a char value.

On Thu, Aug 11, 2011 at 5:52 PM, skuruppu < reply@reply.github.com>wrote:

I constructed a suffix tree for the string in the following file www.csse.unimelb.edu.au/~kuruppu/testdata/REF.raw.gz

Then I was searching for the pattern 'tctcactaagata', repeatedly using the Child() method, and when it reached the last symbol, the suffix array range was [9879882, 9879885]. The correct suffix array range for this pattern is [9879882, 9879892]. I looked at how the Child() method was implemented and narrowed it down to the call to FChild() as the problem. FChild() should've returned the range [9879882, 9879892] but it returns [9879882, 9879885]. I didn't understand how the npr->find_RMQ() method was implemented to figure out what went wrong.

I constructed the suffix tree as follows:

SuffixTreeY *st = new SuffixTreeY(sequence, seqlen, DAC, CN_NPR, 32);

And I searched for the pattern as follows:

char p[] = "tctcactaagata"; size_t pl, pr, cl, cr;

st->Root(&pl, &pr); size_t i = 0; while (p[i]) { st->Child(pl, pr, p[i], &cl, &cr); if (cl == (size_t)-1 || cr == (size_t)-1) break; pl = cl; pr = cr; i++; }

I hope this gives you enough information.

Reply to this email directly or view it on GitHub: https://github.com/fclaude/libcds/issues/2

Reply to this email directly or view it on GitHub: https://github.com/fclaude/libcds/issues/2#issuecomment-1786860

Reply to this email directly or view it on GitHub: https://github.com/fclaude/libcds/issues/2#issuecomment-1787104

Reply to this email directly or view it on GitHub: https://github.com/fclaude/libcds/issues/2#issuecomment-1787978