facebookresearch / DPR

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.
Other
1.7k stars 299 forks source link

OSError: [Errno 12] Cannot allocate memory #56

Closed szhang42 closed 4 years ago

szhang42 commented 4 years ago

Hello,

I am running dense_retriever.py for retriever validation for nq-train. Below is the command I used. I incurred the error as below. Does this error caused by the num_shards in previous generate_dense_embeddings in which I ran the --shard_id 0 with --num_shards 20 and --shard_id 19 with --num_shards 20 and produced both dpr_ctx_0 and dpr_ctx_19? Or this is due to my current machine RAM memory ( I also put my machine information below)? Thanks very much!

Error: Total encoded queries tensor torch.Size([79168, 768]) index search time: 3956.004276 sec. Reading data from: output/data/wikipedia_split/psgs_w100.tsv Matching answers in top docs... Exception in thread Thread-4949: Traceback (most recent call last): File "/opt/apps/intel19/python3/3.7.0/lib/python3.7/threading.py", line 917, in _bootstrap_inner self.run() File "/opt/apps/intel19/python3/3.7.0/lib/python3.7/threading.py", line 865, in run self._target(*self._args, **self._kwargs) File "/opt/apps/intel19/python3/3.7.0/lib/python3.7/multiprocessing/pool.py", line 412, in _handle_workers pool._maintain_pool() File "/opt/apps/intel19/python3/3.7.0/lib/python3.7/multiprocessing/pool.py", line 248, in _maintain_pool self._repopulate_pool() File "/opt/apps/intel19/python3/3.7.0/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool w.start() File "/opt/apps/intel19/python3/3.7.0/lib/python3.7/multiprocessing/process.py", line 112, in start self._popen = self._Popen(self) File "/opt/apps/intel19/python3/3.7.0/lib/python3.7/multiprocessing/context.py", line 277, in _Popen return Popen(process_obj) File "/opt/apps/intel19/python3/3.7.0/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init self._launch(process_obj) File "/opt/apps/intel19/python3/3.7.0/lib/python3.7/multiprocessing/popen_fork.py", line 70, in _launch self.pid = os.fork() OSError: [Errno 12] Cannot allocate memory

Command: python3 dense_retriever.py \ --model_file output/checkpoint/retriever/multiset/bert-base-encoder.cp \ --ctx_file output/data/wikipedia_split/psgs_w100.tsv \ --qa_file output/data/retriever/qas/nq-train.csv \ --encoded_ctx_file output/'dpr_ctx*' \ --out_file output/dpr_retrieval/nq-train.json \ --n-docs 100 \ --validation_workers 32 \ --batch_size 64

Machine information:

Accelerators: | 4 NVIDIA Quadro RTX 5000 / node CUDA Parallel Processing Cores: | 3072 / card NVIDIA Tensor Cores: | 384 / card GPU Memory: | 16GB GDDR6 / card CPUs: | 2 Intel Xeon E5-2620 v4 (“Broadwell”) RAM: | 128GB (2133 MT/s) DDR4 Local storage: | 144GB /tmp partition on a 240GB SSD.

vlad-karpukhin commented 4 years ago

Hello,

this error means you don't have enough RAM. 128GB is usually not enough for the flat index type. dense_retriever.py process alone with our flat index takes 95 GB of RAM

szhang42 commented 4 years ago

Hello,

I see. I did see one of the previously closed issues about this too. I will try to run this larger server. Thanks!

mug2mag commented 3 years ago

@szhang42 Have you solved the problem? 128G server or more large server works?