BackupGGCode / dataparksearch

An open source search engine for Internet and Intranet sites
GNU General Public License v2.0
1 stars 2 forks source link

Regex expressions in stopword file #22

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This is more a question or feature request.

We have many words in our database (non-cached mode) that are irrelevant to
the search engine and we would like an easy mechanism to exclude them from
the index. For example, dict16 and dict32 have thousand of words, with
multiple occurrences, that begin with "$##" and then a string of numbers.
For us, words of this pattern are irrelevant and we would like to not index
them. Is there a way to use regular expressions in the stopwords file? Any
other way to achieve the same result without brute forcing the stopwords
file with every combination we find?

Thanks!

Original issue reported on code.google.com by Imlbr...@gmail.com on 6 Nov 2009 at 9:35

GoogleCodeExporter commented 9 years ago
It's a good idea. I'll implement the StopMatch command in next snapshot (within 
few
days).
Thanks for suggestion.

Original comment by dp.max...@gmail.com on 9 Nov 2009 at 11:19

GoogleCodeExporter commented 9 years ago
Sorry, it took more time than expected at first sight.
Try fresh snapshot
http://dataparksearch.googlecode.com/files/dpsearch-4.53-13122009.tar.bz2

You can use Match: command in a stopwordfile to specify regular expression for
stopwords. NB: they are very primitive regex, but you can use any charset 
supported
by DataparkSearch to specify them.

E.g. for your case the command is:
Match: regex ^\$##

Original comment by dp.max...@gmail.com on 13 Dec 2009 at 2:33

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Hi Maxime,

Got the new version and we are trying it out. You mentioned the regex 
expressions
were very primitive so we don't know if it will support what we are trying to 
do. For
example, can we use "Match: regex ^[^a-z0-9A-Z]+$" to have a stopword be any 
word that
contains a character other than a letter or a number. If not, can we use the 
NoMatch
keyword to accomplish the same with the expression being "NoMatch: regex
^[a-z0-9A-Z]"? If
we can get this to work we think the dbase will shrink considerable, upwards of 
50%.

Thanks!

Original comment by Imlbr...@gmail.com on 5 Jan 2010 at 8:09

GoogleCodeExporter commented 9 years ago
Unfortunately, intervals aren't supported in stopword regex, though you can use
nomatch option with it, so you commmands could be:
Match: nomatch regex 
^[0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]
Match: nomatch regex 
[0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]$

which eliminate all "words" that doesn't start or end with a digit or a letter.

Original comment by dp.max...@gmail.com on 5 Jan 2010 at 9:34