Open trevoradriaanse opened 4 days ago
Hi @trevoradriaanse thanks for reporting this. Just to confirm, you're trying to do the PyPI install, correct? Can you confirm that you're pulling the latest version, 0.43.0?
Hi @lintool that's correct, I verified that pyserini==0.43.0
.
Hrm... I can't seem to reproduce this error. Can you try wget
on these files?
wget https://rgw.cs.uwaterloo.ca/pyserini/data/bge-base-en-v1.5-optimized.onnx
wget https://rgw.cs.uwaterloo.ca/pyserini/data/bge-base-en-v1.5-vocab.txt
The server hosting the models has been flaky of late...
Perhaps you can try again?
The Exception is thrown from this location: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/search/HnswDenseSearcher.java#L123
But BgeBaseEn15
encoder is defined here:
https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/encoder/dense/BgeBaseEn15Encoder.java
And seems fine...?
@lintool Thanks for this info. After running wget
, I have bge-base-en-v1.5-optimized.onnx
and bge-base-en-v1.5-vocab.txt
, but I'm still getting the same exception. I placed these in ~/.cache/pyserini/encoders
.
Are you sure there's not another anserini fatjar that's conflicting? E.g., you did a pip install -e
that's pointing to an old jar?
In pyserini/lib/python3.10/site-packages/pyserini/resources/jars
, I have just one fatjar, anserini-0.38.0-fatjar.jar
.
I just tried again from scratch and cannot repro you error... :(
What Java version are you on? Mine is:
% java --version
openjdk 21.0.5 2024-10-15 LTS
OpenJDK Runtime Environment Zulu21.38+21-CA (build 21.0.5+11-LTS)
OpenJDK 64-Bit Server VM Zulu21.38+21-CA (build 21.0.5+11-LTS, mixed mode, sharing)
I appreciate the help. I get
openjdk 21.0.5-internal 2024-10-15
OpenJDK Runtime Environment (build 21.0.5-internal-adhoc.conda.src)
OpenJDK 64-Bit Server VM (build 21.0.5-internal-adhoc.conda.src, mixed mode, sharing)
Hrm. Are you having trouble running all ONNX models? Go here: https://castorini.github.io/pyserini/2cr/msmarco-v1-passage.html
Pick another ONNX model and see if it's a BGE-specific issue...
Yeah, it seems to be a more general problem. I ran
python -m pyserini.search.lucene \
--threads 16 --batch-size 512 --dense --hnsw \
--index msmarco-v1-passage.cosdpr-distil.hnsw \
--topics dl19-passage \
--onnx-encoder CosDprDistil \
--output run.msmarco-v1-passage.cosdpr-distil.lucene-hnsw.onnx.dl19.txt \
--hits 1000 --ef-search 1000
which encounters the same issue:
jnius.JavaException: JVM exception occurred: Unable to load Encoder "CosDprDistil". java.lang.IllegalArgumentException
Interesting. Can you try with Anserini?
https://github.com/castorini/anserini?tab=readme-ov-file#-try-it
Run the two commands and let me know what you see?
I'm seeing this traceback fragment for Launching Search Threads
after running the two commands
======== Launching Search Threads ============
2024-11-24 10:57:43,281 INFO [main] search.SearchCollection (SearchCollection.java:1346) - runtag: Anserini
Exception in thread "main" java.lang.RuntimeException
at io.anserini.search.SearchCollection$SearcherThread.<init>(SearchCollection.java:835)
at io.anserini.search.SearchCollection.run(SearchCollection.java:1364)
at io.anserini.search.SearchCollection.main(SearchCollection.java:1412)
I'm perplexed... when I run those two commands, everything works fine. See below for my complete execution trace.
Can you pinpoint where you're diverging?
(pyserini-dev6) jimmylin@iMac-Pro tmp % wget https://repo1.maven.org/maven2/io/anserini/anserini/0.38.0/anserini-0.38.0-fatjar.jar
--2024-11-24 13:38:06-- https://repo1.maven.org/maven2/io/anserini/anserini/0.38.0/anserini-0.38.0-fatjar.jar
Resolving repo1.maven.org (repo1.maven.org)... 2a04:4e42:1e::209, 151.101.124.209
Connecting to repo1.maven.org (repo1.maven.org)|2a04:4e42:1e::209|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 187526278 (179M) [application/java-archive]
Saving to: ‘anserini-0.38.0-fatjar.jar’
anserini-0.38.0-fatjar.jar 100%[====================================================================================================================================================================================================================>] 178.84M 51.5MB/s in 3.6s
2024-11-24 13:38:10 (49.6 MB/s) - ‘anserini-0.38.0-fatjar.jar’ saved [187526278/187526278]
(pyserini-dev6) jimmylin@iMac-Pro tmp % java -cp anserini-0.38.0-fatjar.jar io.anserini.search.SearchCollection \
-index msmarco-v1-passage.splade-pp-ed \
-topics msmarco-v1-passage.dev \
-encoder SpladePlusPlusEnsembleDistil \
-output run.msmarco-v1-passage-dev.splade-pp-ed-onnx.txt \
-impact -pretokenized
msmarco-v1-passage.splade-pp-ed 100% │███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████│ 2102230/2102230 (0:01:27 / 0:00:00) Downloading...
Decompressing index...
Index decompressed successfully!
2024-11-24 13:39:55,715 INFO [main] search.SearchCollection (SearchCollection.java:1008) - ============ Initializing Searcher ============
2024-11-24 13:39:55,717 INFO [main] search.SearchCollection (SearchCollection.java:1009) - Index: /Users/jimmylin/.cache/pyserini/indexes/lucene-inverted.msmarco-v1-passage.splade-pp-ed.20230524.a59610.2c008fc36131e27966a72292932358e6
Nov 24, 2024 1:39:55 PM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
2024-11-24 13:39:55,831 INFO [main] search.SearchCollection (SearchCollection.java:1012) - Threads: 4
2024-11-24 13:39:55,831 INFO [main] search.SearchCollection (SearchCollection.java:1013) - Fields: []
2024-11-24 13:39:55,831 INFO [main] search.SearchCollection (SearchCollection.java:1027) - MaxPassage: false
2024-11-24 13:39:55,831 INFO [main] search.SearchCollection (SearchCollection.java:1032) - Hits: 1000
2024-11-24 13:39:55,831 INFO [main] search.SearchCollection (SearchCollection.java:1045) - Collection class: null
2024-11-24 13:39:55,881 INFO [main] search.SearchCollection (SearchCollection.java:1345) - ============ Launching Search Threads ============
2024-11-24 13:39:55,881 INFO [main] search.SearchCollection (SearchCollection.java:1346) - runtag: Anserini
2024-11-24 13:40:06,516 INFO [pool-4-thread-4] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 100 queries processed
2024-11-24 13:40:13,063 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 200 queries processed
2024-11-24 13:40:20,001 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 300 queries processed
2024-11-24 13:40:27,570 INFO [pool-4-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 400 queries processed
2024-11-24 13:40:38,523 INFO [pool-4-thread-4] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 500 queries processed
2024-11-24 13:40:50,111 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 600 queries processed
2024-11-24 13:40:57,777 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 700 queries processed
2024-11-24 13:41:05,536 INFO [pool-4-thread-2] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 800 queries processed
2024-11-24 13:41:13,867 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 900 queries processed
2024-11-24 13:41:21,532 INFO [pool-4-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 1000 queries processed
2024-11-24 13:41:27,793 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 1100 queries processed
2024-11-24 13:41:34,592 INFO [pool-4-thread-2] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 1200 queries processed
2024-11-24 13:41:41,244 INFO [pool-4-thread-2] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 1300 queries processed
2024-11-24 13:41:50,622 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 1400 queries processed
2024-11-24 13:41:58,370 INFO [pool-4-thread-2] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 1500 queries processed
2024-11-24 13:42:05,386 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 1600 queries processed
2024-11-24 13:42:14,830 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 1700 queries processed
2024-11-24 13:42:23,571 INFO [pool-4-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 1800 queries processed
2024-11-24 13:42:32,183 INFO [pool-4-thread-2] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 1900 queries processed
2024-11-24 13:42:40,854 INFO [pool-4-thread-2] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 2000 queries processed
2024-11-24 13:42:48,597 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 2100 queries processed
2024-11-24 13:42:58,171 INFO [pool-4-thread-4] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 2200 queries processed
2024-11-24 13:43:06,772 INFO [pool-4-thread-2] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 2300 queries processed
2024-11-24 13:43:12,299 INFO [pool-4-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 2400 queries processed
2024-11-24 13:43:20,951 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 2500 queries processed
2024-11-24 13:43:32,919 INFO [pool-4-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 2600 queries processed
2024-11-24 13:43:43,206 INFO [pool-4-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 2700 queries processed
2024-11-24 13:43:55,918 INFO [pool-4-thread-2] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 2800 queries processed
2024-11-24 13:44:08,059 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 2900 queries processed
2024-11-24 13:44:20,297 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 3000 queries processed
2024-11-24 13:44:32,301 INFO [pool-4-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 3100 queries processed
2024-11-24 13:44:44,405 INFO [pool-4-thread-2] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 3200 queries processed
2024-11-24 13:44:56,016 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 3300 queries processed
2024-11-24 13:45:06,230 INFO [pool-4-thread-4] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 3400 queries processed
2024-11-24 13:45:16,711 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 3500 queries processed
2024-11-24 13:45:26,642 INFO [pool-4-thread-2] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 3600 queries processed
2024-11-24 13:45:36,334 INFO [pool-4-thread-4] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 3700 queries processed
2024-11-24 13:45:46,849 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 3800 queries processed
2024-11-24 13:45:57,352 INFO [pool-4-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 3900 queries processed
2024-11-24 13:46:08,042 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 4000 queries processed
2024-11-24 13:46:18,701 INFO [pool-4-thread-2] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 4100 queries processed
2024-11-24 13:46:28,533 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 4200 queries processed
2024-11-24 13:46:38,837 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 4300 queries processed
2024-11-24 13:46:50,394 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 4400 queries processed
2024-11-24 13:47:03,052 INFO [pool-4-thread-2] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 4500 queries processed
2024-11-24 13:47:17,144 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 4600 queries processed
2024-11-24 13:47:32,728 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 4700 queries processed
2024-11-24 13:47:53,769 INFO [pool-4-thread-4] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 4800 queries processed
2024-11-24 13:48:17,889 INFO [pool-4-thread-4] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 4900 queries processed
2024-11-24 13:48:45,684 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 5000 queries processed
2024-11-24 13:49:14,770 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 5100 queries processed
2024-11-24 13:49:42,819 INFO [pool-4-thread-4] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 5200 queries processed
2024-11-24 13:50:15,414 INFO [pool-4-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 5300 queries processed
2024-11-24 13:50:48,380 INFO [pool-4-thread-4] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 5400 queries processed
2024-11-24 13:51:24,235 INFO [pool-4-thread-4] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 5500 queries processed
2024-11-24 13:51:53,145 INFO [pool-4-thread-2] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 5600 queries processed
2024-11-24 13:52:22,449 INFO [pool-4-thread-4] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 5700 queries processed
2024-11-24 13:52:40,688 INFO [pool-4-thread-4] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 5800 queries processed
2024-11-24 13:53:00,007 INFO [pool-4-thread-4] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 5900 queries processed
2024-11-24 13:53:13,700 INFO [pool-4-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 6000 queries processed
2024-11-24 13:53:29,710 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 6100 queries processed
2024-11-24 13:53:45,315 INFO [pool-4-thread-4] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 6200 queries processed
2024-11-24 13:53:57,844 INFO [pool-4-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 6300 queries processed
2024-11-24 13:54:12,430 INFO [pool-4-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 6400 queries processed
2024-11-24 13:54:24,889 INFO [pool-4-thread-3] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 6500 queries processed
2024-11-24 13:54:36,129 INFO [pool-4-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 6600 queries processed
2024-11-24 13:54:46,593 INFO [pool-4-thread-2] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 6700 queries processed
2024-11-24 13:54:58,839 INFO [pool-4-thread-2] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 6800 queries processed
2024-11-24 13:55:11,752 INFO [pool-4-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:904) - ranker: impact(), reranker: default: 6900 queries processed
2024-11-24 13:55:22,685 INFO [pool-3-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:925) - ranker: impact(), reranker: default: 6980 queries processed in 00:15:24 = ~7.55 q/s
2024-11-24 13:55:30,405 INFO [main] search.SearchCollection (SearchCollection.java:1418) - Total run time: 00:17:15
I'm seeing
(pyserini) [tadriaanse@test1 scratch]$ java -cp anserini-0.38.0-fatjar.jar io.anserini.search.SearchCollection -index msmarco-v1-passage.splade-pp-ed -topics msmarco-v1-passage.dev -encoder SpladePlusPlusEnsembleDistil -output run.msmarco-v1-passage-dev.splade-pp-ed-onnx.txt -impact -pretokenized
Index file already exists! Skip downloading.
Index folder already exists!
2024-11-26 07:12:11,789 INFO [main] search.SearchCollection (SearchCollection.java:1008) - ============ Initializing Searcher ============
2024-11-26 07:12:11,795 INFO [main] search.SearchCollection (SearchCollection.java:1009) - Index: /home/tadriaanse/.cache/pyserini/indexes/lucene-inverted.msmarco-v1-passage.splade-pp-ed.20230524.a59610.2c008fc36131e27966a72292932358e6
Nov 26, 2024 7:12:11 AM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
2024-11-26 07:12:12,891 INFO [main] search.SearchCollection (SearchCollection.java:1012) - Threads: 4
2024-11-26 07:12:12,892 INFO [main] search.SearchCollection (SearchCollection.java:1013) - Fields: []
2024-11-26 07:12:12,893 INFO [main] search.SearchCollection (SearchCollection.java:1027) - MaxPassage: false
2024-11-26 07:12:12,894 INFO [main] search.SearchCollection (SearchCollection.java:1032) - Hits: 1000
2024-11-26 07:12:12,894 INFO [main] search.SearchCollection (SearchCollection.java:1045) - Collection class: null
2024-11-26 07:12:13,094 INFO [main] search.SearchCollection (SearchCollection.java:1345) - ============ Launching Search Threads ============
2024-11-26 07:12:13,095 INFO [main] search.SearchCollection (SearchCollection.java:1346) - runtag: Anserini
Exception in thread "main" java.lang.RuntimeException
at io.anserini.search.SearchCollection$SearcherThread.<init>(SearchCollection.java:835)
at io.anserini.search.SearchCollection.run(SearchCollection.java:1364)
at io.anserini.search.SearchCollection.main(SearchCollection.java:1412)
Comparing the two traces, it looks like I also get through the calls to SearchCollection.java
and then ultimately crash before starting to process queries.
Hello!
I'm running the installation instructions for HNSW
and I'm getting the following traceback:
I located the encoder model in my Hugging Face cache and noticed there was a
safetensors
formatted model, but I don't see the ONNX model format. Could this be the issue?