Open dave2wave opened 1 year ago
There is a dot net version if that excites David Dieruf - https://lucenenet.apache.org/docs/4.8.0-beta00016/
Also - this class may be helpful - https://solr.apache.org/docs/9_3_0/core/org/apache/solr/analysis/TokenizerChain.html
@dave2wave
This pointers are helpful. I see that Lucene utilities may be helpful in removing stop words and normalising the text (like dealing with contractions, i.e. "can't" -> "cannot")
Initially we are targeting LLMs, so the main problem is to prepare the data to pass as "context" to LLMS. The main issue here is that you have to limit the number of "tokens" and so you must split big texts into smaller chunks. It is important that we use the same algorithms used by the LLMs (like https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb).
In the future we can add more text processing tools based on Lucene in order to refine and clean the documents before sending them to the LLM.
Lucene provides a robust set of tools to build search indexes and then find documents. In fact, Jonathan used Lucene's vector similarity as the basis for VectorSearch.
For our use cases we can take advantage of Lucene's rich set of Analyzers and Filters - https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/package-summary.html
Solr makes use of Lucene and here is Solr's description on how it uses these features: https://solr.apache.org/guide/solr/latest/indexing-guide/document-analysis.html
This page ties back Solr configuration to Lucene and presents the pattern that is similar to what we are doing in SGA. https://solr.apache.org/guide/solr/latest/indexing-guide/analyzers.html