Gadreel / divconq

File Transfer Server Framework
Apache License 2.0
12 stars 5 forks source link

db full text search #101

Open Gadreel opened 9 years ago

Gadreel commented 9 years ago

need indexing - consider: http://java.dzone.com/news/lucenes-fuzzyquery-100-times

Gadreel commented 9 years ago

just use full lucene in parallel to rocksdb. Figure out how to do this efficiently and flexibly:

http://blog.mikemccandless.com/2011/06/primary-key-lookups-are-28x-faster-with.html

Gadreel commented 9 years ago

up to date tutorial:

http://oak.cs.ucla.edu/cs144/projects/lucene/

nope, not recent

Gadreel commented 9 years ago

http://tika.apache.org/

Tika: The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

http://opennlp.apache.org/

toolkit for the processing of natural language text.

It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also includes maximum entropy and perceptron based machine learning.

http://nutch.apache.org/

Nutch - crawler, self crawling?

Gadreel commented 9 years ago

we might want only the analysis part?

http://lucene.apache.org/core/5_0_0/core/org/apache/lucene/analysis/package-summary.html#package_description

though what about snowball and such?

Gadreel commented 9 years ago

each string field in dcTables will get a TextFilter setting. By default this assumes regular paragraph like text and uses the full snowball/stem/stop approach with it. TextFilter="Standard|Title|None" with dcTiny string and related we automatically use Title rather than Standard. With email we use None. Although even none trims ends and lower cases. Title removes less than Standard but more than none.

On String data types in dcSchema one can indicate the DbTextFilterDefault="Standard|Title|None"

Search Request:

{
   Sources: [
       {
          Type: 'table|script',
          Fields: [  { Name: 'x', Importance: 0, Sids: [ 'aaa' ]  } ],
          Title: 'fname',
          Body: 'fname'
       }
   ],
   Phrase: 'free text of required, allowed, exact, prohibited'
}
Gadreel commented 9 years ago

Full value stored:

^dcRecord(did, table, id, fname, stamp, "Data") = value

filtered value stored:

^dcRecord(did, table, id, fname, stamp, "Index") = filtered value

Regardless of the type of filter used we use the None filter and store the first 1024 characters in regular index:

^dcIndex1(did, table, fname, none value, id) = null

With Standard and Title filters we also have:

^dcRecord(did, table, id, fname, stamp, "Analyzed") = |word:pos,pos,...|word:pos|

where pos is relative to original text

Now in ^dcTextIndex we have

^dcTextIndex1(table,field,word) = n ^dcTextIndex1(table,field,word,id) = null

^dcTextIndex2(table,field,word) = n ^dcTextIndex2(table,field,word,id,sid) = null

to save CPU effort we can could the commas in the byte array instead of changing to string first

updates

Analyzed words are stored in alpha order

Thus when updating from old stamp to new stamp

^dcRecord(did, table, id, fname, stamp, "Analyzed") = |airplane:pos|ball:pos|doctor:pos|

^dcRecord(did, table, id, fname, stamp, "Analyzed") = |airplane:pos|bike:pos|doctor:pos|

ball is removed and bike is added, so just do

k ^dcTextIndex1(table,field,"ball",id) - and dec ball s ^dcTextIndex1(table,field,"bike",id) = null - and inc bike

Gadreel commented 9 years ago

searching thus looks for the best fields first:

if we cannot narrow to less than 1000 hits then just dump a bunch and ask to narrow the search

after assigning a field to index off of, calculate the scores for the entire record - track the list of records/scores

try the next field, and then the next

give up on a field after 100 tries and no ability to add to the score list - this field is generating low matches try the next. do not try a record if a score already exists. DO count a repeat record id as a miss, thus if we are finding the SAME records skip to next field when we try 100.

calculate on

return only the top 100, with highlighting and scoring info