facebookresearch / DPR

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.
Other
1.7k stars 299 forks source link

'dpr_all_documents' is not defined #249

Open golubovic opened 11 months ago

golubovic commented 11 months ago

Hi,

I experience issue with global variable ‘dpr_all_documents’ involving tokenizer parallelism, please see logs below. This issue has been raised before for DPR repo.

Note that:

  1. all_docs size has value as expected (I use test document(s) of 5 entries, for testing purposes rather than wiki dataset, please see the log below)
  2. validation_workers is set to 1 in dense_retreiver.yaml (saying that, that setting isn't a problem, I have set it to one just as a safety measure)
  3. I have tried setting TOKENIZERS_PARALLELISM=false (doesn't make a difference). NOTE: Transformers library "0.8.0rc4" has issue with this setting not taking effect currently
  4. I have tried downgrading transformers and tokenizers library to previous versions (no success), good article/comment by [Allohvk] on what is going on with RUST tokenizers used by Huggingface can be found in here https://stackoverflow.com/questions/62691279/how-to-disable-tokenizers-parallelism-true-false-warning
  5. I have tried refactoring dpr_all_documents and passing it as a regular method/function parameter and removing ‘global’ definition, that however results in ‘KeyError’ exception for the given id_prefix of the defined datasource in default_sources.yaml

Please let me know if you have any questions.

Thanks, Mladen

Logs: [2023-09-21 07:35:40,997][dpr.models.hf_models][INFO] - Initializing HF BERT Encoder. cfg_name=bert-base-uncased [2023-09-21 07:35:43,260][dpr.models.hf_models][INFO] - Initializing HF BERT Encoder. cfg_name=bert-base-uncased [2023-09-21 07:35:44,405][root][INFO] - Loading saved model state ... [2023-09-21 07:35:44,611][root][INFO] - Selecting standard question encoder [2023-09-21 07:35:44,677][root][INFO] - Encoder vector_size=768 [2023-09-21 07:35:44,677][root][INFO] - qa_dataset: dpr_ds_retreiving_questions [2023-09-21 07:35:44,680][root][INFO] - questions len 6 [2023-09-21 07:35:44,680][root][INFO] - questions_text len 0 [2023-09-21 07:35:44,680][root][INFO] - Local Index class <class 'dpr.indexer.faiss_indexers.DenseFlatIndexer'> [2023-09-21 07:35:44,680][root][INFO] - Using special token None [2023-09-21 07:35:45,875][root][INFO] - Total encoded queries tensor torch.Size([6, 768]) [2023-09-21 07:35:45,877][root][INFO] - ctx_sources: <class 'dpr.data.retriever_data.CsvCtxSrc'> [2023-09-21 07:35:45,877][root][INFO] - id_prefixes per dataset: ['ds_default_sources_yaml_prefix:'] [2023-09-21 07:35:45,877][root][INFO] - ctx_files_patterns: ['/Users/directory/Developer/DPR-main/checkpoints/generated_embeddings_0'] [2023-09-21 07:35:45,878][root][INFO] - Embeddings files id prefixes: ['ds_default_sources_yaml_prefix:'] [2023-09-21 07:35:45,878][root][INFO] - Reading all passages data from files: ['/Users/directory/Developer/DPR-main/checkpoints/generated_embeddings_0'] [2023-09-21 07:35:45,878][root][INFO] - Reading file /Users/directory/Developer/DPR-main/checkpoints/generated_embeddings_0 [2023-09-21 07:35:45,880][root][INFO] - data indexed 5 [2023-09-21 07:35:45,880][root][INFO] - Total data indexed 5 [2023-09-21 07:35:45,880][root][INFO] - Data indexing completed. [2023-09-21 07:35:45,880][root][INFO] - Serializing index to /Users/directory/Developer/DPR-main/checkpoints/faiss_index_ctx [2023-09-21 07:35:45,883][root][INFO] - index search time: 0.002260 sec. [2023-09-21 07:35:45,884][dpr.data.retriever_data][INFO] - Reading file /Users/directory/Developer/DPR-main/dpr/downloads/data/wikipedia_split/psgs_w100-s.tsv [2023-09-21 07:35:45,885][root][INFO] - Loaded ctx data: 5 [2023-09-21 07:35:45,885][root][INFO] - validating passages. size=5 [2023-09-21 07:35:45,885][dpr.data.qa_validation][INFO] - all_docs size 5 [2023-09-21 07:35:45,885][dpr.data.qa_validation][INFO] - dpr_all_documents size 5 [2023-09-21 07:35:45,925][dpr.data.qa_validation][INFO] - Matching answers in top docs... 2023-09-21 07:35:49,689 [INFO] faiss.loader: Loading faiss with AVX2 support. 2023-09-21 07:35:49,717 [INFO] faiss.loader: Successfully loaded faiss with AVX2 support. /Users/directory/Developer/DPR-main/dense_retriever.py:472: UserWarning: The version_base parameter is not specified. Please specify a compatability version level, or None. Will assume defaults for version 1.1 @hydra.main(config_path="conf", config_name="dense_retriever") Error executing job with overrides: [] multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) File "/Users/directory/Developer/DPR-main/dpr/data/qa_validation.py", line 127, in check_answer doc = dpr_all_documents[doc_id] NameError: name 'dpr_all_documents' is not defined """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/directory/Developer/DPR-main/dense_retriever.py", line 628, in main questions_doc_hits = validate( File "/Users/directory/Developer/DPR-main/dense_retriever.py", line 309, in validate match_stats = calculate_matches(passages, answers, result_ctx_ids, workers_num, match_type) File "/Users/directory/Developer/DPR-main/dpr/data/qa_validation.py", line 68, in calculate_matches scores = processes.map(get_score_partial, questions_answers_docs) File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 364, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 771, in get raise self._value NameError: name 'dpr_all_documents' is not defined

Hannibal046 commented 11 months ago

The dpr_all_documents is defined here and it works on my side: https://github.com/facebookresearch/DPR/blob/a31212dc0a54dfa85d8bfa01e1669f149ac832b7/dpr/data/qa_validation.py#L56C1-L57

golubovic commented 11 months ago

Issue with dpr_all_documents arises when running densre_retreiver.py with small input dataset. Log example above gives input dataset of six questions in total. When dataset is very small this issue surfaces out and dpr_all_documents is not available to all processes which try to access it.

Simple (and not optimal) workaround is to pass the variable in calculate_matches function as additional parameter (please see below). Of course that implies inefficient use of memory as a consequence.

get_score_partial = partial(check_answer, match_type=match_type, tokenizer=tokenizer,dpr_all_documents=dpr_all_documents)