Softdev1 commented 11 months ago

Description

we need to update the current splitting mechanism and langchain splitter in iqgpt ingester to improve the answering capability

Goal

update the lancghain splitter

Yadheedhya06 commented 11 months ago

Sentence Transformer models work well on individual sentences,text-embedding-ada-002 performs better on chunks containing 256 or 512 tokens. Currently we are using text-embedding-ada-002 model with 150 characters per chunk. Which is around 38 tokens which is not ideal (have to change this count)
A shorter query better suited for sentence level embedding, a larger query is suited for matching against paragraph level embedding. Currently IQ GPT has users who queries shorter questions so sentence level embeddings are useful
results need to be fed into another LLM with a token limit, we have to take that into consideration and limit the size of the chunks based on the number of chunks we like to fit into the request to the LLM. Currently 4096 tokens is the token limit, so ideally we retrieve 7 chunks default so each chunk can be 256 token size(total approx 1800)
Most important, we should not remove new line characters in preprocessing step/cleaning step in the paragraphs which are used by text splitters to retain the meaning of sentence. If we remove then words are sliced instead
what works for one source indexer may not work for another. So we need proper evaluation techniques and metrics with test queries to decide upon the optimal solution.

cc: @s-1-n-t-h

s-1-n-t-h commented 11 months ago

Sentence Transformer models work well on individual sentences,text-embedding-ada-002 performs better on chunks containing 256 or 512 tokens. Currently we are using text-embedding-ada-002 model with 150 characters per chunk. Which is around 38 tokens which is not ideal (have to change this count)

A shorter query better suited for sentence level embedding, a larger query is suited for matching against paragraph level embedding. Currently IQ GPT has users who queries shorter questions so sentence level embeddings are useful

results need to be fed into another LLM with a token limit, we have to take that into consideration and limit the size of the chunks based on the number of chunks we like to fit into the request to the LLM. Currently 4096 tokens is the token limit, so ideally we retrieve 7 chunks default so each chunk can be 256 token size(total approx 1800)

Most important, we should not remove new line characters in preprocessing step/cleaning step in the paragraphs which are used by text splitters to retain the meaning of sentence. If we remove then words are sliced instead

what works for one source indexer may not work for another. So we need proper evaluation techniques and metrics with test queries to decide upon the optimal solution.

cc: @s-1-n-t-h

oh good catch, in our cleaning we are actually removing "\n" which is need for recursive text splitter leading us to loss the grouping of contextually similar words. starting from here would be optimal step i think.

Yadheedhya06 commented 11 months ago

Sentence Transformer models work well on individual sentences,text-embedding-ada-002 performs better on chunks containing 256 or 512 tokens. Currently we are using text-embedding-ada-002 model with 150 characters per chunk. Which is around 38 tokens which is not ideal (have to change this count)

A shorter query better suited for sentence level embedding, a larger query is suited for matching against paragraph level embedding. Currently IQ GPT has users who queries shorter questions so sentence level embeddings are useful

results need to be fed into another LLM with a token limit, we have to take that into consideration and limit the size of the chunks based on the number of chunks we like to fit into the request to the LLM. Currently 4096 tokens is the token limit, so ideally we retrieve 7 chunks default so each chunk can be 256 token size(total approx 1800)

Most important, we should not remove new line characters in preprocessing step/cleaning step in the paragraphs which are used by text splitters to retain the meaning of sentence. If we remove then words are sliced instead

what works for one source indexer may not work for another. So we need proper evaluation techniques and metrics with test queries to decide upon the optimal solution.

cc: @s-1-n-t-h

oh good catch, in our cleaning we are actually removing "\n" which is need for recursive text splitter leading us to loss the grouping of contextually similar words. starting from here would be optimal step i think.

yeah

Yadheedhya06 commented 11 months ago

pr : https://github.com/EveripediaNetwork/iq-gpt-ingester-js/pull/44

EveripediaNetwork / issues

update splitter mechanism in ingester to improve embeddings #1962

Description

Goal