LiJiefei / codesearch

Automatically exported from code.google.com/p/codesearch
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

regex->trigram has wrong behaviour when trigrams are question-marked #17

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
bash$ csearch -verbose "foo_(bar_)?" >/dev/null
2012/03/05 10:44:37 query: "_ba" "foo" "o_b" "oo_"
2012/03/05 10:44:37 post query identified 11 possible files
bash$ csearch -verbose "foo_b(ar_)?" >/dev/null
2012/03/05 10:44:44 query: "foo" "o_b" "oo_"
2012/03/05 10:44:44 post query identified 12 possible files

In the first example, "_ba" and "o_b"' should not be required trigrams for the 
file.  In the second example, "ar_" is correctly _not_ included in the list of 
required trigrams.

I'll take a look at this later tonight.  It looks like an issue with precedence 
in index/regexp.go.  (I.e., the question mark is only applying to the final 
trigram, and not all trigrams included in the grouping. ).

Original issue reported on code.google.com by dgryski on 5 Mar 2012 at 10:07

GoogleCodeExporter commented 9 years ago
A bit of investigation has narrowed my focus to concat(x,y)'s handling when x 
or y has canEmpty=true and a prefix/suffix list rather than a list of exact 
matches.  The hunt continues...

Original comment by dgryski on 5 Mar 2012 at 6:29

GoogleCodeExporter commented 9 years ago
I'm stumped.  I think this bug is beyond my current familiarity with the 
codebase.  I'm certain the issue is with concat(), but after many hours of 
staring at this I really have no idea what the fix should look like.

Original comment by dgryski on 6 Mar 2012 at 8:40

GoogleCodeExporter commented 9 years ago
So, it looks like this particular issue is actually in alternate().  The code 
assumes that either x and y are both exact or both have prefix/suffix lists.  
This obviously isn't the case.

The above bug is caused by the fact that emptyString() has an exact match but 
no prefix/suffix lists, so when being combined we were unioning with a nil 
list.  This didn't crash, it just didn't have the expected behaviour.

This also fixes the case where, when searching for '(123|4567)q' , we would get 
"67q" as the only required trigram, instead of the correct ("23q"|"67q").

Original comment by dgryski on 7 Mar 2012 at 9:37

Attachments:

GoogleCodeExporter commented 9 years ago
After writing that last sentence, I realized that my "corrected" trigrams were 
still missing some.  My implementation of alternate ignored the exact matches.

We now call addExact() before combining the matches.  This allows the (this 
time for sure ;) correct trigram set: 
    ("123")|("456" "567")  ("23q"|"67q")
to be generated.

Original comment by dgryski on 7 Mar 2012 at 11:07

Attachments:

GoogleCodeExporter commented 9 years ago
This issue was closed by revision 56b76ffbf8bb.

Original comment by dgryski on 2 May 2012 at 8:26