BUAADreamer / EasyRAG

Easy-to-Use RAG Framework; CCF AIOps International Challenge 2024 Top3 Solution; CCF AIOps 国际挑战赛 2024 季军方案
https://arxiv.org/abs/2410.10315
MIT License
182 stars 23 forks source link

请问如何添加自定义数据集 #7

Closed iriszhan closed 4 days ago

iriszhan commented 4 days ago

我将几份.docx文件和.doc文件作为知识库进行搜索,我将ingestion.py中的read_data函数中的.txt限制去掉了,然后将这几份word文件放在了一个文件夹中,在easyrag.yaml中的数据读取位置更改成了这个文件夹路径,但是运行程序出现了以下错误 Building prefix dict from the default dictionary ... Loading model from cache /tmp/jieba.cache Loading model cost 0.634 seconds. Prefix dict has been built successfully. Traceback (most recent call last): File "/data/EasyRAG-master/src/apitest.py", line 38, in easyrag = EasyRAGPipeline(config) File "/data/EasyRAG-master/src/easyrag/pipeline/pipeline.py", line 57, in init asyncio.get_event_loop().run_until_complete(self.async_init()) File "/data/anaconda/envs/easyrag/lib/python3.10/site-packages/nest_asyncio.py", line 98, in run_until_complete return f.result() File "/data/anaconda/envs/easyrag/lib/python3.10/asyncio/futures.py", line 201, in result raise self._exception.with_traceback(self._exception_tb) File "/data/anaconda/envs/easyrag/lib/python3.10/asyncio/tasks.py", line 232, in step result = coro.send(None) File "/data/EasyRAG-master/src/easyrag/pipeline/pipeline.py", line 207, in async_init self.path_retriever = BM25Retriever.from_defaults( File "/data/EasyRAG-master/src/easyrag/custom/retrievers.py", line 181, in from_defaults return cls( File "/data/EasyRAG-master/src/easyrag/custom/retrievers.py", line 113, in init self.bm25 = BM25Okapi( File "/data/anaconda/envs/easyrag/lib/python3.10/site-packages/rank_bm25.py", line 83, in init super().init(corpus, tokenizer) File "/data/anaconda/envs/easyrag/lib/python3.10/site-packages/rank_bm25.py", line 28, in init__ self._calc_idf(nd) File "/data/anaconda/envs/easyrag/lib/python3.10/site-packages/rank_bm25.py", line 101, in _calc_idf self.average_idf = idf_sum / len(self.idf) ZeroDivisionError: division by zero

iriszhan commented 4 days ago

我好像知道了,我的数据没有知识路径,我把路径粗排去掉后,运行正常了

BUAADreamer commented 4 days ago

哦哦确实,只需要修改一下config的设置即可,或者自己定义新的pipeline