How to Use Lucene Analyzers

dave2wave commented 1 year ago

Lucene provides a robust set of tools to build search indexes and then find documents. In fact, Jonathan used Lucene's vector similarity as the basis for VectorSearch.

For our use cases we can take advantage of Lucene's rich set of Analyzers and Filters - https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/package-summary.html

Solr makes use of Lucene and here is Solr's description on how it uses these features: https://solr.apache.org/guide/solr/latest/indexing-guide/document-analysis.html

This page ties back Solr configuration to Lucene and presents the pattern that is similar to what we are doing in SGA. https://solr.apache.org/guide/solr/latest/indexing-guide/analyzers.html

dave2wave commented 1 year ago

There is a dot net version if that excites David Dieruf - https://lucenenet.apache.org/docs/4.8.0-beta00016/

dave2wave commented 1 year ago

Also - this class may be helpful - https://solr.apache.org/docs/9_3_0/core/org/apache/solr/analysis/TokenizerChain.html

eolivelli commented 1 year ago

@dave2wave

This pointers are helpful. I see that Lucene utilities may be helpful in removing stop words and normalising the text (like dealing with contractions, i.e. "can't" -> "cannot")

Initially we are targeting LLMs, so the main problem is to prepare the data to pass as "context" to LLMS. The main issue here is that you have to limit the number of "tokens" and so you must split big texts into smaller chunks. It is important that we use the same algorithms used by the LLMs (like https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb).

In the future we can add more text processing tools based on Lucene in order to refine and clean the documents before sending them to the LLM.

LangStream / langstream

How to Use Lucene Analyzers #136