Closed Panizghi closed 1 month ago
To efficiently perform NFCorpus indexing using Safetensors, follow this setup workflow:
wget https://rgw.cs.uwaterloo.ca/pyserini/data/beir-v1.0.0-bge-base-en-v1.5.tar -P collections/tar xvf collections/beir-v1.0.0-bge-base-en-v1.5.tar -C collections/
cd /anserini/src/main/python/safetensors
pip install -r requirements.txt
python3 -m venv venv
source venv/bin/activate
python3 -m json_to_bin
To build HNSWSafetensors indexes, use the following sample command:
bin/run.sh io.anserini.index.SafeTensorsIndexCollection \
-collection JsonDenseVectorCollection \
-input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus \
-index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ \
-generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator \
-threads 9 -storePositions -storeDocvectors -storeRaw \
-vectorsDirectory target\safetesnors\vectors \
-docidsDirectory target\safetesnors\docids \
-docidToIdxDirectory target\safetesnors\docid_to_idx \
>& logs/log.beir-v1.0.0-bge-base-en-v1.5 &
Ensure all paths and parameters are adjusted according to your setup and directory structure.
Can you make the safetensors collection go into collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/
, alongside the original? So all files should go into collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/
.
We also shouldn't need a new indexer. The indexing command should be similar to https://github.com/castorini/anserini/blob/master/docs/regressions/regressions-beir-v1.0.0-nfcorpus-bge-base-en-v1.5-hnsw.md
e.g.,
bin/run.sh io.anserini.index.IndexHnswDenseVectors \
-collection JsonDenseVectorCollection \
-input /path/to/beir-v1.0.0-bge-base-en-v1.5 \
-generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator \
-index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus-bge-base-en-v1.5/ \
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge \
>& logs/log.beir-v1.0.0-bge-base-en-v1.5 &
With the only exception being a different -generator
.
python src/main/python/safetensors/json_to_bin.py
bin/run.sh io.anserini.index.IndexHnswDenseVectors -collection JsonDenseVectorCollection -input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus -generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator -index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ -threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.5 &
python src/main/python/safetensors/json_to_bin.py --input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl --output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus
bin/run.sh io.anserini.index.IndexHnswDenseVectors -collection JsonDenseVectorCollection -input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus -generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator -index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ -threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.112 &
Looking at this command:
bin/run.sh io.anserini.index.SafeTensorsIndexCollection \
-collection JsonDenseVectorCollection \
-input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus \
-index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ \
-generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator \
-threads 9 -storePositions -storeDocvectors -storeRaw \
-vectorsDirectory target\safetesnors\vectors \
-docidsDirectory target\safetesnors\docids \
-docidToIdxDirectory target\safetesnors\docid_to_idx \
>& logs/log.beir-v1.0.0-bge-base-en-v1.5 &
What are these three options doing?
-vectorsDirectory target\safetesnors\vectors \
-docidsDirectory target\safetesnors\docids \
-docidToIdxDirectory target\safetesnors\docid_to_idx \
And why are these the same?
-input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus \
-index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ \
I would expect -index
to specify the location of the index?
I think you are looking at the older command this is the updated one
bin/run.sh io.anserini.index.IndexHnswDenseVectors \ -collection JsonDenseVectorCollection \ -input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus \ -generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator \ -index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ -threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.112 &
I think you are looking at the older command this is the updated one
bin/run.sh io.anserini.index.IndexHnswDenseVectors \ -collection JsonDenseVectorCollection \ -input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus \ -generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator \ -index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ -threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.112 &
Ah, please update to keep up to date?
My apologies it got lost within all the commits : ) is right here https://github.com/castorini/anserini/pull/2515#issuecomment-2216502740
python src/main/python/safetensors/json_to_bin.py
--input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl
--output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus
bin/run.sh io.anserini.index.IndexHnswDenseVectors
-collection JsonDenseVectorCollection
-input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus
-generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator
-index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.5 &
Sorry, I'm confused again:
bin/run.sh io.anserini.index.IndexHnswDenseVectors
-collection JsonDenseVectorCollection
-input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus
-generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator
-index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.5 &
Why would -collection
be JsonDenseVectorCollection
now? Currently, DenseVectorDocumentGenerator
reads from Json: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/generator/DenseVectorDocumentGenerator.java
So we'd have something like SafeTensorsDenseVectorCollection that reasons from SafeTenors?
I'm not getting your logic, but I think you need to implement two classes:
SafeTensorsDenseVectorCollection
SafeTensorsDenseVectorDocumentGenerator
And your command would be something like -collection SafeTensorsDenseVectorCollection ... -generator SafeTensorsDenseVectorDocumentGenerator
.
And you'd "wire everything together".
updated command :
bin/run.sh io.anserini.index.IndexHnswDenseVectors
-collection SafeTensorsDenseVectorCollection
-input collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus
-generator SafeTensorsDenseVectorDocumentGenerator
-index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.5 &
@Panizghi if I'm reading your code correctly, you're assuming that there's only one vector file per directory, right? This is not necessary the case.
For example, for robust04
:
$ ls robust04/
vectors.part00.jsonl.gz vectors.part01.jsonl.gz vectors.part02.jsonl.gz vectors.part03.jsonl.gz vectors.part04.jsonl.gz vectors.part05.jsonl.gz
@Panizghi on your branch, running:
$ python src/main/python/safetensors/json_to_bin.py \
--input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl.gz \
--output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus
Works fine. However, I would like some progress indication... e.g., using tqdm?
Also, what do I do if there is more than one vector part?
However, more compact, as excepted, which is good.
$ du -h collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/
22M collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/
$ du -h collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/
84M collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/
Running indexing command:
bin/run.sh io.anserini.index.IndexHnswDenseVectors \
-collection SafeTensorsDenseVectorCollection \
-input collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus \
-generator SafeTensorsDenseVectorDocumentGenerator \
-index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge
Something's not right... get an exception:
2024-08-23 07:40:33,960 INFO [main] index.AbstractIndexer (AbstractIndexer.java:205) - Setting log level to INFO
2024-08-23 07:40:33,963 INFO [main] index.AbstractIndexer (AbstractIndexer.java:208) - ============ Loading Index Configuration ============
2024-08-23 07:40:33,963 INFO [main] index.AbstractIndexer (AbstractIndexer.java:209) - AbstractIndexer settings:
2024-08-23 07:40:33,963 INFO [main] index.AbstractIndexer (AbstractIndexer.java:210) - + DocumentCollection path: collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus
2024-08-23 07:40:33,963 INFO [main] index.AbstractIndexer (AbstractIndexer.java:211) - + CollectionClass: SafeTensorsDenseVectorCollection
2024-08-23 07:40:33,963 INFO [main] index.AbstractIndexer (AbstractIndexer.java:212) - + Index path: indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/
2024-08-23 07:40:33,964 INFO [main] index.AbstractIndexer (AbstractIndexer.java:213) - + Threads: 16
2024-08-23 07:40:33,964 INFO [main] index.AbstractIndexer (AbstractIndexer.java:214) - + Optimize (merge segments)? false
Aug 23, 2024 7:40:34 AM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
2024-08-23 07:40:34,217 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:149) - HnswIndexer settings:
2024-08-23 07:40:34,217 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:150) - + Generator: SafeTensorsDenseVectorDocumentGenerator
2024-08-23 07:40:34,218 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:151) - + M: 16
2024-08-23 07:40:34,218 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:152) - + efC: 100
2024-08-23 07:40:34,218 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:153) - + Store document vectors? false
2024-08-23 07:40:34,218 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:154) - + Int8 quantization? false
2024-08-23 07:40:34,218 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:155) - + Codec: Lucene99
2024-08-23 07:40:34,219 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:156) - + MemoryBuffer: 65536
2024-08-23 07:40:34,219 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:157) - + MaxThreadMemoryBeforeFlush: 2047
2024-08-23 07:40:34,219 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:160) - + MergePolicy: NoMerge
2024-08-23 07:40:34,219 INFO [main] index.AbstractIndexer (AbstractIndexer.java:238) - ============ Indexing Collection ============
2024-08-23 07:40:34,222 INFO [main] index.AbstractIndexer (AbstractIndexer.java:247) - Thread pool with 16 threads initialized.
2024-08-23 07:40:34,222 INFO [main] index.AbstractIndexer (AbstractIndexer.java:248) - 2 files found in collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus
2024-08-23 07:40:34,222 INFO [main] index.AbstractIndexer (AbstractIndexer.java:249) - Starting to index...
2024-08-23 07:40:34,225 INFO [pool-2-thread-1] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:48) - Processing document ID: MED-10 with thread: pool-2-thread-1
2024-08-23 07:40:34,225 WARN [pool-2-thread-2] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:43) - Document ID: MED-10 is already being processed by another thread.
java.lang.NullPointerException
at java.base/java.util.Objects.requireNonNull(Objects.java:233)
at java.base/java.util.ImmutableCollections$List12.<init>(ImmutableCollections.java:563)
at java.base/java.util.List.of(List.java:937)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1837)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1476)
at io.anserini.index.AbstractIndexer$IndexerThread.run(AbstractIndexer.java:135)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
2024-08-23 07:40:34,229 ERROR [pool-2-thread-2] index.AbstractIndexer$IndexerThread (AbstractIndexer.java:179) - pool-2-thread-2: Unexpected Exception:
2024-08-23 07:40:34,235 INFO [pool-2-thread-1] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:56) - Vector length: 768 for document ID: MED-10
Aug 23, 2024 7:40:34 AM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=256; FMA enabled
2024-08-23 07:40:34,277 INFO [pool-2-thread-1] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:64) - Document created for ID: MED-10
20
@Panizghi if I'm reading your code correctly, you're assuming that there's only one vector file per directory, right? This is not necessary the case.
For example, for
robust04
:$ ls robust04/ vectors.part00.jsonl.gz vectors.part01.jsonl.gz vectors.part02.jsonl.gz vectors.part03.jsonl.gz vectors.part04.jsonl.gz vectors.part05.jsonl.gz
Yes that is correct on the early discussion we kep it only for nfcorpus with single file, I will update the code for the multiple file handling
Running indexing command:
bin/run.sh io.anserini.index.IndexHnswDenseVectors \ -collection SafeTensorsDenseVectorCollection \ -input collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus \ -generator SafeTensorsDenseVectorDocumentGenerator \ -index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \ -threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge
Something's not right... get an exception:
2024-08-23 07:40:33,960 INFO [main] index.AbstractIndexer (AbstractIndexer.java:205) - Setting log level to INFO 2024-08-23 07:40:33,963 INFO [main] index.AbstractIndexer (AbstractIndexer.java:208) - ============ Loading Index Configuration ============ 2024-08-23 07:40:33,963 INFO [main] index.AbstractIndexer (AbstractIndexer.java:209) - AbstractIndexer settings: 2024-08-23 07:40:33,963 INFO [main] index.AbstractIndexer (AbstractIndexer.java:210) - + DocumentCollection path: collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus 2024-08-23 07:40:33,963 INFO [main] index.AbstractIndexer (AbstractIndexer.java:211) - + CollectionClass: SafeTensorsDenseVectorCollection 2024-08-23 07:40:33,963 INFO [main] index.AbstractIndexer (AbstractIndexer.java:212) - + Index path: indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ 2024-08-23 07:40:33,964 INFO [main] index.AbstractIndexer (AbstractIndexer.java:213) - + Threads: 16 2024-08-23 07:40:33,964 INFO [main] index.AbstractIndexer (AbstractIndexer.java:214) - + Optimize (merge segments)? false Aug 23, 2024 7:40:34 AM org.apache.lucene.store.MemorySegmentIndexInputProvider <init> INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false 2024-08-23 07:40:34,217 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:149) - HnswIndexer settings: 2024-08-23 07:40:34,217 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:150) - + Generator: SafeTensorsDenseVectorDocumentGenerator 2024-08-23 07:40:34,218 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:151) - + M: 16 2024-08-23 07:40:34,218 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:152) - + efC: 100 2024-08-23 07:40:34,218 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:153) - + Store document vectors? false 2024-08-23 07:40:34,218 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:154) - + Int8 quantization? false 2024-08-23 07:40:34,218 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:155) - + Codec: Lucene99 2024-08-23 07:40:34,219 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:156) - + MemoryBuffer: 65536 2024-08-23 07:40:34,219 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:157) - + MaxThreadMemoryBeforeFlush: 2047 2024-08-23 07:40:34,219 INFO [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:160) - + MergePolicy: NoMerge 2024-08-23 07:40:34,219 INFO [main] index.AbstractIndexer (AbstractIndexer.java:238) - ============ Indexing Collection ============ 2024-08-23 07:40:34,222 INFO [main] index.AbstractIndexer (AbstractIndexer.java:247) - Thread pool with 16 threads initialized. 2024-08-23 07:40:34,222 INFO [main] index.AbstractIndexer (AbstractIndexer.java:248) - 2 files found in collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus 2024-08-23 07:40:34,222 INFO [main] index.AbstractIndexer (AbstractIndexer.java:249) - Starting to index... 2024-08-23 07:40:34,225 INFO [pool-2-thread-1] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:48) - Processing document ID: MED-10 with thread: pool-2-thread-1 2024-08-23 07:40:34,225 WARN [pool-2-thread-2] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:43) - Document ID: MED-10 is already being processed by another thread. java.lang.NullPointerException at java.base/java.util.Objects.requireNonNull(Objects.java:233) at java.base/java.util.ImmutableCollections$List12.<init>(ImmutableCollections.java:563) at java.base/java.util.List.of(List.java:937) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1837) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1476) at io.anserini.index.AbstractIndexer$IndexerThread.run(AbstractIndexer.java:135) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.base/java.lang.Thread.run(Thread.java:1583) 2024-08-23 07:40:34,229 ERROR [pool-2-thread-2] index.AbstractIndexer$IndexerThread (AbstractIndexer.java:179) - pool-2-thread-2: Unexpected Exception: 2024-08-23 07:40:34,235 INFO [pool-2-thread-1] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:56) - Vector length: 768 for document ID: MED-10 Aug 23, 2024 7:40:34 AM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init> INFO: Java vector incubator API enabled; uses preferredBitSize=256; FMA enabled 2024-08-23 07:40:34,277 INFO [pool-2-thread-1] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:64) - Document created for ID: MED-10 20
This should be fixed now and work with the same command
@Panizghi on your branch, running:
$ python src/main/python/safetensors/json_to_bin.py \ --input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl.gz \ --output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus
Works fine. However, I would like some progress indication... e.g., using tqdm?
Also, what do I do if there is more than one vector part?
However, more compact, as excepted, which is good.
$ du -h collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/ 22M collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/ $ du -h collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ 84M collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/
tqdm is added there is --overwrite in arguments which also you can use if the file already exists
command:
python src/main/python/safetensors/json_to_bin.py \
--input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl \
--output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus --overwrite
sample output :
Processing lines: 100%|█████████████████████████████████████████████████| 3633/3633 [00:01<00:00, 3347.56it/s]
2024-08-25 00:53:02,642 - INFO - Saved vectors to collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/vectors.part00_vectors.safetensors
2024-08-25 00:53:02,643 - INFO - Saved docids to collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/vectors.part00_docids.safetensors
2024-08-25 00:53:02,643 - INFO - Loaded vectors from collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/vectors.part00_vectors.safetensors
2024-08-25 00:53:02,644 - INFO - Loaded document IDs (ASCII) from collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/vectors.part00_docids.safetensors
For vector parts are we considering a case like this?
{
"docid": "MED-10",
"vector_1": [0.00344, 0.00231, ...],
"vector_2": [0.00112, 0.00456, ...]
}
Okay, I can now run these commands:
python src/main/python/safetensors/json_to_bin.py \
--input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl.gz \
--output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus --overwrite
bin/run.sh io.anserini.index.IndexHnswDenseVectors \
-collection SafeTensorsDenseVectorCollection \
-input collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus \
-generator SafeTensorsDenseVectorDocumentGenerator \
-index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge
After I build the index, I should be able to switch to retrieval, here: https://github.com/castorini/anserini/blob/master/docs/regressions/regressions-beir-v1.0.0-nfcorpus.bge-base-en-v1.5.hnsw.onnx.md
The retrieval command is this:
bin/run.sh io.anserini.search.SearchHnswDenseVectors \
-index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \
-topics tools/topics-and-qrels/topics.beir-v1.0.0-nfcorpus.test.tsv.gz \
-topicReader TsvString \
-output runs/run.beir-v1.0.0-nfcorpus.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-nfcorpus.test.txt \
-generator VectorQueryGenerator -topicField title -removeQuery -threads 16 -hits 1000 -efSearch 1000 -encoder BgeBaseEn15
But the eval command generates errors:
$ bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-nfcorpus.test.txt runs/run.beir-v1.0.0-nfcorpus.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-nfcorpus.test.txt
WARNING: Using incubator modules: jdk.incubator.vector
trec_eval.form_res_qrels: duplicate docs MED-1000trec_eval: Can't calculate measure 'ndcg_cut'
From here:
$ head runs/run.beir-v1.0.0-nfcorpus.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-nfcorpus.test.txt
PLAIN-1008 Q0 MED-2036 1 0.776562 Anserini
PLAIN-1008 Q0 MED-2036 2 0.776562 Anserini
PLAIN-1008 Q0 MED-5135 3 0.775252 Anserini
PLAIN-1008 Q0 MED-5135 4 0.775252 Anserini
PLAIN-1008 Q0 MED-4694 5 0.774549 Anserini
PLAIN-1008 Q0 MED-4694 6 0.774549 Anserini
PLAIN-1008 Q0 MED-3865 7 0.773869 Anserini
PLAIN-1008 Q0 MED-3865 8 0.773869 Anserini
PLAIN-1008 Q0 MED-3316 9 0.771660 Anserini
PLAIN-1008 Q0 MED-3316 10 0.771660 Anserini
I appear to be getting duplicates of docs, e.g., MED-2036
. Are you somehow indexing everything twice?
Okay, I can now run these commands:
python src/main/python/safetensors/json_to_bin.py \ --input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl.gz \ --output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus --overwrite bin/run.sh io.anserini.index.IndexHnswDenseVectors \ -collection SafeTensorsDenseVectorCollection \ -input collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus \ -generator SafeTensorsDenseVectorDocumentGenerator \ -index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \ -threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge
After I build the index, I should be able to switch to retrieval, here: https://github.com/castorini/anserini/blob/master/docs/regressions/regressions-beir-v1.0.0-nfcorpus.bge-base-en-v1.5.hnsw.onnx.md
The retrieval command is this:
bin/run.sh io.anserini.search.SearchHnswDenseVectors \ -index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \ -topics tools/topics-and-qrels/topics.beir-v1.0.0-nfcorpus.test.tsv.gz \ -topicReader TsvString \ -output runs/run.beir-v1.0.0-nfcorpus.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-nfcorpus.test.txt \ -generator VectorQueryGenerator -topicField title -removeQuery -threads 16 -hits 1000 -efSearch 1000 -encoder BgeBaseEn15
But the eval command generates errors:
$ bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-nfcorpus.test.txt runs/run.beir-v1.0.0-nfcorpus.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-nfcorpus.test.txt WARNING: Using incubator modules: jdk.incubator.vector trec_eval.form_res_qrels: duplicate docs MED-1000trec_eval: Can't calculate measure 'ndcg_cut'
From here:
$ head runs/run.beir-v1.0.0-nfcorpus.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-nfcorpus.test.txt PLAIN-1008 Q0 MED-2036 1 0.776562 Anserini PLAIN-1008 Q0 MED-2036 2 0.776562 Anserini PLAIN-1008 Q0 MED-5135 3 0.775252 Anserini PLAIN-1008 Q0 MED-5135 4 0.775252 Anserini PLAIN-1008 Q0 MED-4694 5 0.774549 Anserini PLAIN-1008 Q0 MED-4694 6 0.774549 Anserini PLAIN-1008 Q0 MED-3865 7 0.773869 Anserini PLAIN-1008 Q0 MED-3865 8 0.773869 Anserini PLAIN-1008 Q0 MED-3316 9 0.771660 Anserini PLAIN-1008 Q0 MED-3316 10 0.771660 Anserini
I appear to be getting duplicates of docs, e.g.,
MED-2036
. Are you somehow indexing everything twice?
That was initially the reason I swapped to single thread and having critical section testing the fix right now
Updated command :
python src/main/python/safetensors/json_to_bin.py --input collections/robust04 --output collections/robust04.safetensors/ --overwrite
bin/run.sh io.anserini.index.IndexHnswDenseVectors \
-collection SafeTensorsDenseVectorCollection \
-input collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus \
-generator SafeTensorsDenseVectorDocumentGenerator \
-index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge
Original JSONL Files (Total: 3.6 GB):
vectors.part00.jsonl.gz
: 683 MBvectors.part01.jsonl.gz
: 683 MBvectors.part02.jsonl.gz
: 682 MBvectors.part03.jsonl.gz
: 683 MBvectors.part04.jsonl.gz
: 683 MBvectors.part05.jsonl.gz
: 192 MBConverted Safetensor Files (Total: 3.1 GB):
vectors.part00_vectors.safetensors
: 586 MBvectors.part01_vectors.safetensors
: 586 MBvectors.part02_vectors.safetensors
: 586 MBvectors.part03_vectors.safetensors
: 586 MBvectors.part04_vectors.safetensors
: 586 MBvectors.part05_vectors.safetensors
: 165 MBvectors.part00_docids.safetensors
: 13 MBvectors.part01_docids.safetensors
: 10 MBvectors.part02_docids.safetensors
: 10 MBvectors.part03_docids.safetensors
: 13 MBvectors.part04_docids.safetensors
: 10 MBvectors.part05_docids.safetensors
: 2.8 MBSuperseded by #2582
Linked issue : https://github.com/castorini/ura-projects/issues/31#issuecomment-2076092779 @17Melissa will provide the flow command below :)