RUC-NLPIR / FlashRAG

⚡FlashRAG: A Python Toolkit for Efficient RAG Research
https://arxiv.org/abs/2405.13576
MIT License
890 stars 69 forks source link

我在构建用于检索的index中遇到了错误 #20

Closed lwj2001 closed 3 weeks ago

lwj2001 commented 1 month ago

我遵循了docs/process-wiki.md 中的步骤,一步一步构建索引,最终在step3中报错了 报错日志:

Start chunking...
  0%|                                                                                                                                                                                                            | 0/5514591 [00:00<?, ?it/s]
/Data1/home/fanziqi/.conda/envs/flashrag/lib/python3.9/site-packages/spacy/language.py:1734: UserWarning: [W127] Not all `Language.pipe` worker processes completed successfully
  warnings.warn(Warnings.W127)
  0%|                                                                                                                                                                                                            | 0/5514591 [20:28<?, ?it/s]
Traceback (most recent call last):
  File "/Data1/home/fanziqi/func_eval/FlashRAG/scripts/preprocess_wiki.py", line 224, in <module>
    for doc in tqdm(nlp.pipe(all_text, n_process=args.num_workers, batch_size=2000), total=len(all_text)):
  File "/Data1/home/fanziqi/.conda/envs/flashrag/lib/python3.9/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/Data1/home/fanziqi/.conda/envs/flashrag/lib/python3.9/site-packages/spacy/language.py", line 1618, in pipe
    for doc in docs:
  File "/Data1/home/fanziqi/.conda/envs/flashrag/lib/python3.9/site-packages/spacy/language.py", line 1698, in _multiprocessing_pipe
    for i, (_, (byte_doc, context, byte_error)) in enumerate(
  File "/Data1/home/fanziqi/.conda/envs/flashrag/lib/python3.9/site-packages/spacy/language.py", line 1695, in <genexpr>
    recv.recv() for recv in cycle(bytedocs_recv_ch)
  File "/Data1/home/fanziqi/.conda/envs/flashrag/lib/python3.9/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/Data1/home/fanziqi/.conda/envs/flashrag/lib/python3.9/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/Data1/home/fanziqi/.conda/envs/flashrag/lib/python3.9/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

似乎是在开始chunking之后报错了。 想请问您:

  1. 诸如此类的WARNING: Template errors in article 'Jarosław Olech' (56462895): title(1) recursion(0, 0, 0) 对build index有什么影响吗?
  2. 该bug我该如何解决?(第一次在chunking 7%的时候报错了,第二次在0%的时候报错了)
ignorejjj commented 1 month ago

For your error, it seems to be due to your insufficient memory. You can try reducing the batch size and max_workers (you can set to 1 for testing).

The warning in building wiki corpus is raised by wikiextractor, it is normal (see similar issue in wikiextractor: https://github.com/attardi/wikiextractor/issues/33). I guess it's possible that wikiextractor uses some templates to parse wiki pages, but some pages couldn't match. The number of such pages is very small, so it shouldn't matter.