Capstone-S17 / DuRAG

0 stars 0 forks source link

better error messages for reranker #10

Closed Dhanushkmr closed 3 months ago

Dhanushkmr commented 3 months ago

photo_2024-03-28 20 15 39

to replicate this

from DuRAG import Reranker, SentenceWindowRetriever, Generator
from rag_swr import swr_pipeline
import weaviate.classes as wvc
import weaviate
filters = ['SG230712OTHRNR1F_Old Chang Kee Ltd._20230712174222_00_AR_4Q_20230331.2.pdf']
BGE_QUERY_PREFIX = "Represent this sentence for searching relevant passages: "
query = question
client = weaviate.connect_to_local()
reranker = Reranker()
swr_engine = SentenceWindowRetriever(client)
bge_query = BGE_QUERY_PREFIX + query
filter_params = swr_engine._get_filter_param(filters, mode="or", property_name="pdf_name")
retrieval_response = swr_engine.hybrid_search(bge_query, limit=10, filter_params=filter_params)
sentence_windows = swr_engine.get_sentence_windows(retrieval_response.objects)
results = swr_engine.get_rerank_format(query, sentence_windows)

print(results)
reranked_results = reranker.rerank_top_k(results, 5)
Dhanushkmr commented 3 months ago

The root cause is extremely poor pdf extraction quality. only 1 page seems to have been extracted. only 1 chunks in weaviate for SWR. either way, its good that it exposed this issue.