Open golubovic opened 11 months ago
The dpr_all_documents
is defined here and it works on my side:
https://github.com/facebookresearch/DPR/blob/a31212dc0a54dfa85d8bfa01e1669f149ac832b7/dpr/data/qa_validation.py#L56C1-L57
Issue with dpr_all_documents arises when running densre_retreiver.py with small input dataset. Log example above gives input dataset of six questions in total. When dataset is very small this issue surfaces out and dpr_all_documents is not available to all processes which try to access it.
Simple (and not optimal) workaround is to pass the variable in calculate_matches function as additional parameter (please see below). Of course that implies inefficient use of memory as a consequence.
get_score_partial = partial(check_answer, match_type=match_type, tokenizer=tokenizer,dpr_all_documents=dpr_all_documents)
Hi,
I experience issue with global variable ‘dpr_all_documents’ involving tokenizer parallelism, please see logs below. This issue has been raised before for DPR repo.
Note that:
Please let me know if you have any questions.
Thanks, Mladen
Logs: [2023-09-21 07:35:40,997][dpr.models.hf_models][INFO] - Initializing HF BERT Encoder. cfg_name=bert-base-uncased [2023-09-21 07:35:43,260][dpr.models.hf_models][INFO] - Initializing HF BERT Encoder. cfg_name=bert-base-uncased [2023-09-21 07:35:44,405][root][INFO] - Loading saved model state ... [2023-09-21 07:35:44,611][root][INFO] - Selecting standard question encoder [2023-09-21 07:35:44,677][root][INFO] - Encoder vector_size=768 [2023-09-21 07:35:44,677][root][INFO] - qa_dataset: dpr_ds_retreiving_questions [2023-09-21 07:35:44,680][root][INFO] - questions len 6 [2023-09-21 07:35:44,680][root][INFO] - questions_text len 0 [2023-09-21 07:35:44,680][root][INFO] - Local Index class <class 'dpr.indexer.faiss_indexers.DenseFlatIndexer'> [2023-09-21 07:35:44,680][root][INFO] - Using special token None [2023-09-21 07:35:45,875][root][INFO] - Total encoded queries tensor torch.Size([6, 768]) [2023-09-21 07:35:45,877][root][INFO] - ctx_sources: <class 'dpr.data.retriever_data.CsvCtxSrc'> [2023-09-21 07:35:45,877][root][INFO] - id_prefixes per dataset: ['ds_default_sources_yaml_prefix:'] [2023-09-21 07:35:45,877][root][INFO] - ctx_files_patterns: ['/Users/directory/Developer/DPR-main/checkpoints/generated_embeddings_0'] [2023-09-21 07:35:45,878][root][INFO] - Embeddings files id prefixes: ['ds_default_sources_yaml_prefix:'] [2023-09-21 07:35:45,878][root][INFO] - Reading all passages data from files: ['/Users/directory/Developer/DPR-main/checkpoints/generated_embeddings_0'] [2023-09-21 07:35:45,878][root][INFO] - Reading file /Users/directory/Developer/DPR-main/checkpoints/generated_embeddings_0 [2023-09-21 07:35:45,880][root][INFO] - data indexed 5 [2023-09-21 07:35:45,880][root][INFO] - Total data indexed 5 [2023-09-21 07:35:45,880][root][INFO] - Data indexing completed. [2023-09-21 07:35:45,880][root][INFO] - Serializing index to /Users/directory/Developer/DPR-main/checkpoints/faiss_index_ctx [2023-09-21 07:35:45,883][root][INFO] - index search time: 0.002260 sec. [2023-09-21 07:35:45,884][dpr.data.retriever_data][INFO] - Reading file /Users/directory/Developer/DPR-main/dpr/downloads/data/wikipedia_split/psgs_w100-s.tsv [2023-09-21 07:35:45,885][root][INFO] - Loaded ctx data: 5 [2023-09-21 07:35:45,885][root][INFO] - validating passages. size=5 [2023-09-21 07:35:45,885][dpr.data.qa_validation][INFO] - all_docs size 5 [2023-09-21 07:35:45,885][dpr.data.qa_validation][INFO] - dpr_all_documents size 5 [2023-09-21 07:35:45,925][dpr.data.qa_validation][INFO] - Matching answers in top docs... 2023-09-21 07:35:49,689 [INFO] faiss.loader: Loading faiss with AVX2 support. 2023-09-21 07:35:49,717 [INFO] faiss.loader: Successfully loaded faiss with AVX2 support. /Users/directory/Developer/DPR-main/dense_retriever.py:472: UserWarning: The version_base parameter is not specified. Please specify a compatability version level, or None. Will assume defaults for version 1.1 @hydra.main(config_path="conf", config_name="dense_retriever") Error executing job with overrides: [] multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) File "/Users/directory/Developer/DPR-main/dpr/data/qa_validation.py", line 127, in check_answer doc = dpr_all_documents[doc_id] NameError: name 'dpr_all_documents' is not defined """
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/Users/directory/Developer/DPR-main/dense_retriever.py", line 628, in main questions_doc_hits = validate( File "/Users/directory/Developer/DPR-main/dense_retriever.py", line 309, in validate match_stats = calculate_matches(passages, answers, result_ctx_ids, workers_num, match_type) File "/Users/directory/Developer/DPR-main/dpr/data/qa_validation.py", line 68, in calculate_matches scores = processes.map(get_score_partial, questions_answers_docs) File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 364, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 771, in get raise self._value NameError: name 'dpr_all_documents' is not defined