Closed Dhanushkmr closed 3 months ago
to replicate this
from DuRAG import Reranker, SentenceWindowRetriever, Generator from rag_swr import swr_pipeline import weaviate.classes as wvc import weaviate filters = ['SG230712OTHRNR1F_Old Chang Kee Ltd._20230712174222_00_AR_4Q_20230331.2.pdf'] BGE_QUERY_PREFIX = "Represent this sentence for searching relevant passages: " query = question client = weaviate.connect_to_local() reranker = Reranker() swr_engine = SentenceWindowRetriever(client) bge_query = BGE_QUERY_PREFIX + query filter_params = swr_engine._get_filter_param(filters, mode="or", property_name="pdf_name") retrieval_response = swr_engine.hybrid_search(bge_query, limit=10, filter_params=filter_params) sentence_windows = swr_engine.get_sentence_windows(retrieval_response.objects) results = swr_engine.get_rerank_format(query, sentence_windows) print(results) reranked_results = reranker.rerank_top_k(results, 5)
The root cause is extremely poor pdf extraction quality. only 1 page seems to have been extracted. only 1 chunks in weaviate for SWR. either way, its good that it exposed this issue.
to replicate this