Open Gadreel opened 9 years ago
just use full lucene in parallel to rocksdb. Figure out how to do this efficiently and flexibly:
http://blog.mikemccandless.com/2011/06/primary-key-lookups-are-28x-faster-with.html
Tika: The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.
toolkit for the processing of natural language text.
It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also includes maximum entropy and perceptron based machine learning.
Nutch - crawler, self crawling?
we might want only the analysis part?
though what about snowball and such?
each string field in dcTables will get a TextFilter setting. By default this assumes regular paragraph like text and uses the full snowball/stem/stop approach with it. TextFilter="Standard|Title|None" with dcTiny string and related we automatically use Title rather than Standard. With email we use None. Although even none trims ends and lower cases. Title removes less than Standard but more than none.
On String data types in dcSchema one can indicate the DbTextFilterDefault="Standard|Title|None"
Search Request:
{
Sources: [
{
Type: 'table|script',
Fields: [ { Name: 'x', Importance: 0, Sids: [ 'aaa' ] } ],
Title: 'fname',
Body: 'fname'
}
],
Phrase: 'free text of required, allowed, exact, prohibited'
}
Full value stored:
^dcRecord(did, table, id, fname, stamp, "Data") = value
filtered value stored:
^dcRecord(did, table, id, fname, stamp, "Index") = filtered value
Regardless of the type of filter used we use the None filter and store the first 1024 characters in regular index:
^dcIndex1(did, table, fname, none value, id) = null
With Standard and Title filters we also have:
^dcRecord(did, table, id, fname, stamp, "Analyzed") = |word:pos,pos,...|word:pos|
where pos is relative to original text
Now in ^dcTextIndex we have
^dcTextIndex1(table,field,word) = n ^dcTextIndex1(table,field,word,id) = null
^dcTextIndex2(table,field,word) = n ^dcTextIndex2(table,field,word,id,sid) = null
to save CPU effort we can could the commas in the byte array instead of changing to string first
Analyzed words are stored in alpha order
Thus when updating from old stamp to new stamp
^dcRecord(did, table, id, fname, stamp, "Analyzed") = |airplane:pos|ball:pos|doctor:pos|
^dcRecord(did, table, id, fname, stamp, "Analyzed") = |airplane:pos|bike:pos|doctor:pos|
ball is removed and bike is added, so just do
k ^dcTextIndex1(table,field,"ball",id) - and dec ball s ^dcTextIndex1(table,field,"bike",id) = null - and inc bike
searching thus looks for the best fields first:
if we cannot narrow to less than 1000 hits then just dump a bunch and ask to narrow the search
after assigning a field to index off of, calculate the scores for the entire record - track the list of records/scores
try the next field, and then the next
give up on a field after 100 tries and no ability to add to the score list - this field is generating low matches try the next. do not try a record if a score already exists. DO count a repeat record id as a miss, thus if we are finding the SAME records skip to next field when we try 100.
calculate on
return only the top 100, with highlighting and scoring info
need indexing - consider: http://java.dzone.com/news/lucenes-fuzzyquery-100-times