castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.67k stars 370 forks source link

Dense retrieval replication: efficiency notes #297

Closed MXueguang closed 3 years ago

MXueguang commented 3 years ago

HNSW index (single query):

As discussed in https://github.com/castorini/pyserini/pull/292# Replication from @lintool

MacOS, Intel Xeon W CPU @ 2.20GHz 18 cores
trial 1: 6980/6980 [05:16<00:00, 22.08it/s]
trial 2: 6980/6980 [05:02<00:00, 23.04it/s]
MRR @10: 0.33394187474416553

Replication from @justram

Machine1 (Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz) 56 cores):
6980/6980 [11:52<00:00, 9.79it/s]
MRR@10: 0.33395142584254184

Machine2 (Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz 56 cores):
6980/6980 [07:21<00:00, 15.80it/s]
MRR@10: 0.33395142584254184

Replication from @MXueguang

Ubuntu 20.04.1 LTS Intel Core i7-8700k CPU @ 3.70GHz x 12
6980/6980 [03:39<00:00, 31.86it/s] (run with 12 cores)
6980/6980 [03:13<00:00, 36.15it/s] (run with 8 cores)
6980/6980 [03:07<00:00, 37.26it/s] (run with 4 cores)
6980/6980 [02:53<00:00, 40.12it/s] (run with 1 core)
MRR @10: 0.33395142584254184

It seems multithreading on single query for hnsw doesn't improve the efficiency.

justram commented 3 years ago

FYI, change the OMP_NUM_THREADS variable

Ubuntu 20.04.1 LTS (Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz) 56 cores):
6980/6980 [11:52<00:00, 9.79it/s] (run with 56 cores)
6980/6980 [06:35<00:00, 17.66it/s] (run with 12 cores)
6980/6980 [06:14<00:00, 18.65it/s] (run with 8 cores)
MXueguang commented 3 years ago

maybe we want to replicate HNSW by searching with batches? i.e.

python -m pyserini.dsearch --topics msmarco_passage_dev_subset \
                             --index msmarco-passage-tct_colbert-hnsw \
                             --encoded-queries msmarco-passage-dev-subset-tct_colbert \
                             --batch 12  \
                             --output runs/run.msmarco-passage.tct_colbert.hnsw.tsv \
                             --msmarco
582/582 [01:01<00:00,  9.45it/s]
MRR @10: 0.33395142584254184

it will make things faster, as we did on brute force index.

lintool commented 3 years ago

Yes, we should carefully analyze the effects of intra-query parallelism vs. inter-query parallelism.

The former is splitting up a single query across multiple threads. The latter is batching.

justram commented 3 years ago

Way better to set a large batch size

cmd:

export OMP_NUM_THREADS=56
python -m pyserini.dsearch --topics msmarco_passage_dev_subset \
                             --index msmarco-passage-tct_colbert-hnsw \
                             --encoded-queries msmarco-passage-dev-subset-tct_colbert \
                             --batch 56  \
                             --output runs/run.msmarco-passage.tct_colbert.hnsw.tsv \
                             --msmarco
Ubuntu 20.04.1 LTS (Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz) 56 cores):
125/125 [01:12<00:00,  1.73it/s]
lintool commented 3 years ago

Which means we should have separate --threads and --batch options?

MXueguang commented 3 years ago

Which means we should have separate --threads and --batch options?

I think so. I'll do that

justram commented 3 years ago

Brute force replication for the record:

cmd:
python -m pyserini.dsearch --topics msmarco_passage_dev_subset \
                             --index msmarco-passage-tct_colbert-bf \
                             --encoded-queries msmarco-passage-dev-subset-tct_colbert \
                             --batch 56  \
                             --output runs/run.msmarco-passage.tct_colbert.bf.tsv \
                             --msmarco
Ubuntu 20.04.1 LTS (Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz) 56 cores):
125/125 [45:00<00:00, 21.61s/it]
MRR @10: 0.33444603629417247
lintool commented 3 years ago

I ran different batch sizes on my iMac Pro:

batch =  24: 291/291 [07:58<00:00,  1.65s/it]
batch =  36: 194/194 [06:20<00:00,  1.96s/it]
batch =  48: 146/146 [05:13<00:00,  2.15s/it]
batch =  60: 117/117 [05:00<00:00,  2.57s/it]
batch =  72:   97/97 [04:20<00:00,  2.69s/it]
batch =  84:   84/84 [04:14<00:00,  3.03s/it]
batch =  96:   73/73 [04:16<00:00,  3.51s/it]
batch = 108:   65/65 [03:55<00:00,  3.62s/it]
batch = 120:   59/59 [03:52<00:00,  3.94s/it]

Seems to like big batches...

MXueguang commented 3 years ago

I ran different batch sizes on my iMac Pro: Seems to like big batches...

with fix threads num?

lintool commented 3 years ago

Yes, I didn't specify --threads, so whatever the default is. I ran this on the original PR.