castorini / anserini

Anserini is a Lucene toolkit for reproducible information retrieval research
http://anserini.io/
Apache License 2.0
1.03k stars 457 forks source link

HnswDensevector SafeTensor Generator #2515

Closed Panizghi closed 1 month ago

Panizghi commented 4 months ago

Linked issue : https://github.com/castorini/ura-projects/issues/31#issuecomment-2076092779 @17Melissa will provide the flow command below :)

17Melissa commented 4 months ago

Setup for NFCorpus Indexing with Safetensors

To efficiently perform NFCorpus indexing using Safetensors, follow this setup workflow:

  1. Download and Unzip Collections
    • Begin by downloading the necessary collections and unzipping them. For instance: wget https://rgw.cs.uwaterloo.ca/pyserini/data/beir-v1.0.0-bge-base-en-v1.5.tar -P collections/tar xvf collections/beir-v1.0.0-bge-base-en-v1.5.tar -C collections/
  2. Prepare the Environment
    • Navigate to the Safetensors directory within the Anserini project cd /anserini/src/main/python/safetensors
    • Install the required Python packages: pip install -r requirements.txt
    • Activate the virtual environment python3 -m venv venv source venv/bin/activate
  3. Convert JSON to Safetensors Format
    • Use the provided script to convert JSON files to Safetensors format python3 -m json_to_bin
    • the script will create the following files in the target directory
      • Saved vectors to ../../../../target/safetensors/vectors/part00_vectors.safetensors
      • Saved docids to ../../../../target/safetensors/docids/part00_docids.safetensors
      • Saved docid_to_idx mapping to ../../../../target/safetensors/docid_to_idx/part00_docid_to_idx.json

        Indexing Procedure

        To build HNSWSafetensors indexes, use the following sample command:

        bin/run.sh io.anserini.index.SafeTensorsIndexCollection \
        -collection JsonDenseVectorCollection \
        -input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus  \
        -index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ \
        -generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator \
        -threads 9 -storePositions -storeDocvectors -storeRaw \
        -vectorsDirectory target\safetesnors\vectors \
        -docidsDirectory  target\safetesnors\docids \
        -docidToIdxDirectory  target\safetesnors\docid_to_idx \
        >& logs/log.beir-v1.0.0-bge-base-en-v1.5 &

        Ensure all paths and parameters are adjusted according to your setup and directory structure.

lintool commented 4 months ago

Can you make the safetensors collection go into collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/, alongside the original? So all files should go into collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/.

We also shouldn't need a new indexer. The indexing command should be similar to https://github.com/castorini/anserini/blob/master/docs/regressions/regressions-beir-v1.0.0-nfcorpus-bge-base-en-v1.5-hnsw.md

e.g.,

bin/run.sh io.anserini.index.IndexHnswDenseVectors \
  -collection JsonDenseVectorCollection \
  -input /path/to/beir-v1.0.0-bge-base-en-v1.5 \
  -generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator \
  -index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus-bge-base-en-v1.5/ \
  -threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge \
  >& logs/log.beir-v1.0.0-bge-base-en-v1.5 &

With the only exception being a different -generator.

17Melissa commented 4 months ago

Updated Workflow for Safetensors Conversion and Indexing Process

  1. Create Directory: Create the safetensors folder collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus
  2. Run Conversion Script: Execute the python script json_to_bin.py from the root directory using the command: python src/main/python/safetensors/json_to_bin.py
  3. Execute Indexing Command: Following the indexing command below, which you will run after the conversion script completes
    bin/run.sh io.anserini.index.IndexHnswDenseVectors -collection JsonDenseVectorCollection -input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus  -generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator -index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ -threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.5 &
Panizghi commented 3 months ago

Updates

Updated commands

Python

python src/main/python/safetensors/json_to_bin.py --input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl --output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus

Java

bin/run.sh io.anserini.index.IndexHnswDenseVectors -collection JsonDenseVectorCollection -input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus -generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator -index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ -threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.112 &
lintool commented 3 months ago

Looking at this command:

bin/run.sh io.anserini.index.SafeTensorsIndexCollection \
  -collection JsonDenseVectorCollection \
  -input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus  \
  -index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ \
  -generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator \
  -threads 9 -storePositions -storeDocvectors -storeRaw \
  -vectorsDirectory target\safetesnors\vectors \
  -docidsDirectory  target\safetesnors\docids \
  -docidToIdxDirectory  target\safetesnors\docid_to_idx \
>& logs/log.beir-v1.0.0-bge-base-en-v1.5 &

What are these three options doing?

  -vectorsDirectory target\safetesnors\vectors \
  -docidsDirectory  target\safetesnors\docids \
  -docidToIdxDirectory  target\safetesnors\docid_to_idx \

And why are these the same?

  -input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus  \
  -index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ \

I would expect -index to specify the location of the index?

Panizghi commented 3 months ago

I think you are looking at the older command this is the updated one

bin/run.sh io.anserini.index.IndexHnswDenseVectors  \
-collection JsonDenseVectorCollection \
-input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus \
-generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator \ 
-index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ 
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.112 &
lintool commented 3 months ago

I think you are looking at the older command this is the updated one

bin/run.sh io.anserini.index.IndexHnswDenseVectors  \
-collection JsonDenseVectorCollection \
-input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus \
-generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator \ 
-index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ 
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.112 &

Ah, please update to keep up to date?

Panizghi commented 3 months ago

My apologies it got lost within all the commits : ) is right here https://github.com/castorini/anserini/pull/2515#issuecomment-2216502740

Python

python src/main/python/safetensors/json_to_bin.py 
--input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl 
--output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus

Java

bin/run.sh io.anserini.index.IndexHnswDenseVectors 
-collection JsonDenseVectorCollection 
-input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus 
-generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator 
-index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ 
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.5 &
lintool commented 3 months ago

Sorry, I'm confused again:

bin/run.sh io.anserini.index.IndexHnswDenseVectors 
-collection JsonDenseVectorCollection 
-input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus 
-generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator 
-index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ 
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.5 &

Why would -collection be JsonDenseVectorCollection now? Currently, DenseVectorDocumentGenerator reads from Json: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/generator/DenseVectorDocumentGenerator.java

So we'd have something like SafeTensorsDenseVectorCollection that reasons from SafeTenors?

lintool commented 3 months ago

I'm not getting your logic, but I think you need to implement two classes:

And your command would be something like -collection SafeTensorsDenseVectorCollection ... -generator SafeTensorsDenseVectorDocumentGenerator.

And you'd "wire everything together".

Panizghi commented 3 months ago

updated command :

bin/run.sh io.anserini.index.IndexHnswDenseVectors 
-collection SafeTensorsDenseVectorCollection 
-input collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus   
-generator SafeTensorsDenseVectorDocumentGenerator 
-index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ 
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge  >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.5 &
lintool commented 1 month ago

@Panizghi if I'm reading your code correctly, you're assuming that there's only one vector file per directory, right? This is not necessary the case.

For example, for robust04:

$ ls robust04/
vectors.part00.jsonl.gz  vectors.part01.jsonl.gz  vectors.part02.jsonl.gz  vectors.part03.jsonl.gz  vectors.part04.jsonl.gz  vectors.part05.jsonl.gz
lintool commented 1 month ago

@Panizghi on your branch, running:

$ python src/main/python/safetensors/json_to_bin.py \
--input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl.gz \
--output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus

Works fine. However, I would like some progress indication... e.g., using tqdm?

Also, what do I do if there is more than one vector part?


However, more compact, as excepted, which is good.

$ du -h collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/
22M collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/
$ du -h collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/
84M collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/
lintool commented 1 month ago

Running indexing command:

bin/run.sh io.anserini.index.IndexHnswDenseVectors \
-collection SafeTensorsDenseVectorCollection \
-input collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus \
-generator SafeTensorsDenseVectorDocumentGenerator \
-index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge

Something's not right... get an exception:

2024-08-23 07:40:33,960 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:205) - Setting log level to INFO
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:208) - ============ Loading Index Configuration ============
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:209) - AbstractIndexer settings:
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:210) -  + DocumentCollection path: collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:211) -  + CollectionClass: SafeTensorsDenseVectorCollection
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:212) -  + Index path: indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/
2024-08-23 07:40:33,964 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:213) -  + Threads: 16
2024-08-23 07:40:33,964 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:214) -  + Optimize (merge segments)? false
Aug 23, 2024 7:40:34 AM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
2024-08-23 07:40:34,217 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:149) - HnswIndexer settings:
2024-08-23 07:40:34,217 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:150) -  + Generator: SafeTensorsDenseVectorDocumentGenerator
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:151) -  + M: 16
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:152) -  + efC: 100
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:153) -  + Store document vectors? false
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:154) -  + Int8 quantization? false
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:155) -  + Codec: Lucene99
2024-08-23 07:40:34,219 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:156) -  + MemoryBuffer: 65536
2024-08-23 07:40:34,219 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:157) -  + MaxThreadMemoryBeforeFlush: 2047
2024-08-23 07:40:34,219 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:160) -  + MergePolicy: NoMerge
2024-08-23 07:40:34,219 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:238) - ============ Indexing Collection ============
2024-08-23 07:40:34,222 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:247) - Thread pool with 16 threads initialized.
2024-08-23 07:40:34,222 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:248) - 2 files found in collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus
2024-08-23 07:40:34,222 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:249) - Starting to index...
2024-08-23 07:40:34,225 INFO  [pool-2-thread-1] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:48) - Processing document ID: MED-10 with thread: pool-2-thread-1
2024-08-23 07:40:34,225 WARN  [pool-2-thread-2] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:43) - Document ID: MED-10 is already being processed by another thread.
java.lang.NullPointerException
    at java.base/java.util.Objects.requireNonNull(Objects.java:233)
    at java.base/java.util.ImmutableCollections$List12.<init>(ImmutableCollections.java:563)
    at java.base/java.util.List.of(List.java:937)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1837)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1476)
    at io.anserini.index.AbstractIndexer$IndexerThread.run(AbstractIndexer.java:135)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1583)
2024-08-23 07:40:34,229 ERROR [pool-2-thread-2] index.AbstractIndexer$IndexerThread (AbstractIndexer.java:179) - pool-2-thread-2: Unexpected Exception:
2024-08-23 07:40:34,235 INFO  [pool-2-thread-1] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:56) - Vector length: 768 for document ID: MED-10
Aug 23, 2024 7:40:34 AM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=256; FMA enabled
2024-08-23 07:40:34,277 INFO  [pool-2-thread-1] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:64) - Document created for ID: MED-10
20
Panizghi commented 1 month ago

@Panizghi if I'm reading your code correctly, you're assuming that there's only one vector file per directory, right? This is not necessary the case.

For example, for robust04:

$ ls robust04/
vectors.part00.jsonl.gz  vectors.part01.jsonl.gz  vectors.part02.jsonl.gz  vectors.part03.jsonl.gz  vectors.part04.jsonl.gz  vectors.part05.jsonl.gz

Yes that is correct on the early discussion we kep it only for nfcorpus with single file, I will update the code for the multiple file handling

Panizghi commented 1 month ago

Running indexing command:

bin/run.sh io.anserini.index.IndexHnswDenseVectors \
-collection SafeTensorsDenseVectorCollection \
-input collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus \
-generator SafeTensorsDenseVectorDocumentGenerator \
-index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge

Something's not right... get an exception:

2024-08-23 07:40:33,960 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:205) - Setting log level to INFO
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:208) - ============ Loading Index Configuration ============
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:209) - AbstractIndexer settings:
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:210) -  + DocumentCollection path: collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:211) -  + CollectionClass: SafeTensorsDenseVectorCollection
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:212) -  + Index path: indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/
2024-08-23 07:40:33,964 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:213) -  + Threads: 16
2024-08-23 07:40:33,964 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:214) -  + Optimize (merge segments)? false
Aug 23, 2024 7:40:34 AM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
2024-08-23 07:40:34,217 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:149) - HnswIndexer settings:
2024-08-23 07:40:34,217 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:150) -  + Generator: SafeTensorsDenseVectorDocumentGenerator
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:151) -  + M: 16
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:152) -  + efC: 100
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:153) -  + Store document vectors? false
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:154) -  + Int8 quantization? false
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:155) -  + Codec: Lucene99
2024-08-23 07:40:34,219 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:156) -  + MemoryBuffer: 65536
2024-08-23 07:40:34,219 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:157) -  + MaxThreadMemoryBeforeFlush: 2047
2024-08-23 07:40:34,219 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:160) -  + MergePolicy: NoMerge
2024-08-23 07:40:34,219 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:238) - ============ Indexing Collection ============
2024-08-23 07:40:34,222 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:247) - Thread pool with 16 threads initialized.
2024-08-23 07:40:34,222 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:248) - 2 files found in collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus
2024-08-23 07:40:34,222 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:249) - Starting to index...
2024-08-23 07:40:34,225 INFO  [pool-2-thread-1] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:48) - Processing document ID: MED-10 with thread: pool-2-thread-1
2024-08-23 07:40:34,225 WARN  [pool-2-thread-2] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:43) - Document ID: MED-10 is already being processed by another thread.
java.lang.NullPointerException
  at java.base/java.util.Objects.requireNonNull(Objects.java:233)
  at java.base/java.util.ImmutableCollections$List12.<init>(ImmutableCollections.java:563)
  at java.base/java.util.List.of(List.java:937)
  at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1837)
  at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1476)
  at io.anserini.index.AbstractIndexer$IndexerThread.run(AbstractIndexer.java:135)
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
  at java.base/java.lang.Thread.run(Thread.java:1583)
2024-08-23 07:40:34,229 ERROR [pool-2-thread-2] index.AbstractIndexer$IndexerThread (AbstractIndexer.java:179) - pool-2-thread-2: Unexpected Exception:
2024-08-23 07:40:34,235 INFO  [pool-2-thread-1] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:56) - Vector length: 768 for document ID: MED-10
Aug 23, 2024 7:40:34 AM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=256; FMA enabled
2024-08-23 07:40:34,277 INFO  [pool-2-thread-1] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:64) - Document created for ID: MED-10
20

This should be fixed now and work with the same command

Panizghi commented 1 month ago

@Panizghi on your branch, running:

$ python src/main/python/safetensors/json_to_bin.py \
--input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl.gz \
--output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus

Works fine. However, I would like some progress indication... e.g., using tqdm?

Also, what do I do if there is more than one vector part?

However, more compact, as excepted, which is good.

$ du -h collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/
22M   collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/
$ du -h collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/
84M   collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/

tqdm is added there is --overwrite in arguments which also you can use if the file already exists

command:

python src/main/python/safetensors/json_to_bin.py \
--input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl \
--output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus --overwrite

sample output :

Processing lines: 100%|█████████████████████████████████████████████████| 3633/3633 [00:01<00:00, 3347.56it/s]
2024-08-25 00:53:02,642 - INFO - Saved vectors to collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/vectors.part00_vectors.safetensors
2024-08-25 00:53:02,643 - INFO - Saved docids to collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/vectors.part00_docids.safetensors
2024-08-25 00:53:02,643 - INFO - Loaded vectors from collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/vectors.part00_vectors.safetensors
2024-08-25 00:53:02,644 - INFO - Loaded document IDs (ASCII) from collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/vectors.part00_docids.safetensors

For vector parts are we considering a case like this?

{
  "docid": "MED-10",
  "vector_1": [0.00344, 0.00231, ...],
  "vector_2": [0.00112, 0.00456, ...]
}
lintool commented 1 month ago

Okay, I can now run these commands:

python src/main/python/safetensors/json_to_bin.py \
--input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl.gz \
--output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus --overwrite

bin/run.sh io.anserini.index.IndexHnswDenseVectors \
-collection SafeTensorsDenseVectorCollection \
-input collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus \
-generator SafeTensorsDenseVectorDocumentGenerator \
-index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge

After I build the index, I should be able to switch to retrieval, here: https://github.com/castorini/anserini/blob/master/docs/regressions/regressions-beir-v1.0.0-nfcorpus.bge-base-en-v1.5.hnsw.onnx.md

The retrieval command is this:

bin/run.sh io.anserini.search.SearchHnswDenseVectors \
  -index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \
  -topics tools/topics-and-qrels/topics.beir-v1.0.0-nfcorpus.test.tsv.gz \
  -topicReader TsvString \
  -output runs/run.beir-v1.0.0-nfcorpus.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-nfcorpus.test.txt \
  -generator VectorQueryGenerator -topicField title -removeQuery -threads 16 -hits 1000 -efSearch 1000 -encoder BgeBaseEn15

But the eval command generates errors:

$ bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-nfcorpus.test.txt runs/run.beir-v1.0.0-nfcorpus.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-nfcorpus.test.txt
WARNING: Using incubator modules: jdk.incubator.vector
trec_eval.form_res_qrels: duplicate docs MED-1000trec_eval: Can't calculate measure 'ndcg_cut'

From here:

$ head runs/run.beir-v1.0.0-nfcorpus.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-nfcorpus.test.txt
PLAIN-1008 Q0 MED-2036 1 0.776562 Anserini
PLAIN-1008 Q0 MED-2036 2 0.776562 Anserini
PLAIN-1008 Q0 MED-5135 3 0.775252 Anserini
PLAIN-1008 Q0 MED-5135 4 0.775252 Anserini
PLAIN-1008 Q0 MED-4694 5 0.774549 Anserini
PLAIN-1008 Q0 MED-4694 6 0.774549 Anserini
PLAIN-1008 Q0 MED-3865 7 0.773869 Anserini
PLAIN-1008 Q0 MED-3865 8 0.773869 Anserini
PLAIN-1008 Q0 MED-3316 9 0.771660 Anserini
PLAIN-1008 Q0 MED-3316 10 0.771660 Anserini

I appear to be getting duplicates of docs, e.g., MED-2036. Are you somehow indexing everything twice?

Panizghi commented 1 month ago

Okay, I can now run these commands:

python src/main/python/safetensors/json_to_bin.py \
--input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl.gz \
--output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus --overwrite

bin/run.sh io.anserini.index.IndexHnswDenseVectors \
-collection SafeTensorsDenseVectorCollection \
-input collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus \
-generator SafeTensorsDenseVectorDocumentGenerator \
-index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge

After I build the index, I should be able to switch to retrieval, here: https://github.com/castorini/anserini/blob/master/docs/regressions/regressions-beir-v1.0.0-nfcorpus.bge-base-en-v1.5.hnsw.onnx.md

The retrieval command is this:

bin/run.sh io.anserini.search.SearchHnswDenseVectors \
  -index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \
  -topics tools/topics-and-qrels/topics.beir-v1.0.0-nfcorpus.test.tsv.gz \
  -topicReader TsvString \
  -output runs/run.beir-v1.0.0-nfcorpus.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-nfcorpus.test.txt \
  -generator VectorQueryGenerator -topicField title -removeQuery -threads 16 -hits 1000 -efSearch 1000 -encoder BgeBaseEn15

But the eval command generates errors:

$ bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-nfcorpus.test.txt runs/run.beir-v1.0.0-nfcorpus.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-nfcorpus.test.txt
WARNING: Using incubator modules: jdk.incubator.vector
trec_eval.form_res_qrels: duplicate docs MED-1000trec_eval: Can't calculate measure 'ndcg_cut'

From here:

$ head runs/run.beir-v1.0.0-nfcorpus.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-nfcorpus.test.txt
PLAIN-1008 Q0 MED-2036 1 0.776562 Anserini
PLAIN-1008 Q0 MED-2036 2 0.776562 Anserini
PLAIN-1008 Q0 MED-5135 3 0.775252 Anserini
PLAIN-1008 Q0 MED-5135 4 0.775252 Anserini
PLAIN-1008 Q0 MED-4694 5 0.774549 Anserini
PLAIN-1008 Q0 MED-4694 6 0.774549 Anserini
PLAIN-1008 Q0 MED-3865 7 0.773869 Anserini
PLAIN-1008 Q0 MED-3865 8 0.773869 Anserini
PLAIN-1008 Q0 MED-3316 9 0.771660 Anserini
PLAIN-1008 Q0 MED-3316 10 0.771660 Anserini

I appear to be getting duplicates of docs, e.g., MED-2036. Are you somehow indexing everything twice?

That was initially the reason I swapped to single thread and having critical section testing the fix right now

Panizghi commented 1 month ago

Updated command :

python src/main/python/safetensors/json_to_bin.py --input collections/robust04 --output collections/robust04.safetensors/ --overwrite
bin/run.sh io.anserini.index.IndexHnswDenseVectors \
-collection SafeTensorsDenseVectorCollection \
-input collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus \
-generator SafeTensorsDenseVectorDocumentGenerator \
-index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge

Indexing Performance:

File Sizes:

lintool commented 1 month ago

Superseded by #2582