Indexing method for learned sparse retrieval

thongnt99 commented 2 years ago

Description

Recent learned sparse retrieval methods (Splade, uniCOIL) were trained to generate impact score directly (replacing tf-idf score).
For each document, they will generate a json file with terms and weights, e.g. {";": 80, "the": 161, "of": 85, "and": 27, "to": 24, "was": 47, "as": 27, "their": 96, "what": 40, "over": 123, "only": 123, "important": 186, "project": 208, "success": 215, "meant": 131, "lives": 140, "presence": 180, "scientific": 200, "communication": 235, "thousands": 142, "hundreds": 144, "truly": 170, "hanging": 141, "cloud": 187, "engineers": 127, "achievement": 192, "researchers": 137, "innocent": 181, "manhattan": 244, "impressive": 191, "equally": 163, "##rated": 132, "minds": 137, "atomic": 214, "amid": 201, "##lite": 120, "intellect": 202, "ob": 140}} Can we make a new feature that could index this type of document efficiently? The current work-around I am aware of is to create a fake document by repeating the terms: e.g., "the the the the .... of of of of of " However, this way is not very efficient if the impact score gets bigger and also it requires impact score quantization before indexing. I think it would be very useful for many people if we can index the json files directly with float impact scores.

mocobeta commented 2 years ago

In general I'm +1 for supporting learned sparse retrieval, though, I think it would not be so trivial as it looks.

For a starter perhaps we could utilize terms' payloads to tweak the weights instead of modifying the indexing chain... but there may be some overheads in score calculation.

rmuir commented 2 years ago

You can use TermFrequencyAttribute in the analysis chain to set the frequency directly.

jtibshirani commented 2 years ago

+1 from me too, it'd be great to think through how to support this. Could you explain how the query side would look? Are the queries also sparse vectors with custom impacts?

As a note, we have a FeatureField field type that accepts key-value pairs and stores the value in TermFrequencyAttribute. It's designed to help incorporate other storing signals like popularity, page rank, etc. It may not be exactly what we want for this use case, but it could provide some inspiration.

msokolov commented 2 years ago

Using TermFrequencyAttribute to customize the term frequencies you can then create a Query in the normal way and compute BM25 using b==0 then I think you will directly control the similarity scores. Or you might want to write a custom Similarity to be a bit more efficient.

thongnt99 commented 2 years ago

@jtibshirani The query side is same as document side, which is a dictionary of terms and weights. To make it compatible with Lucene, people just repeat the terms with its frequency. This is fine because queries are usually much shorter. Yes, FeatureField is something similar, but we want a single Field containing a list of key-value pairs or a json formatted. @msokolov @rmuir @mocobeta: I fould this, which could somehow achieve what we want; But I think it is not so flexible, we need to turn the json file into a token stream formatted as: [<term><delimiter><frequency>......]... I think this step is redundant. Can we just load the json file directly? For this I think we might have to move away from TokenStream pipeline?

The scoring function is just a dot product, very simple.

What do you think? Your thought is very much appreciated as I am not very familiar with Lucene.

We can form a group to do this if you guys are interested in.

jpountz commented 2 years ago

we want a single Field containing a list of key-value pairs or a json formatted

Note that you can add one FeatureField field to your Lucene document for every key/value pair in your JSON document. The logic of converting from a high-level representation like a JSON map into a low-level representation that Lucene understands feels like something that could be managed on the application side?

Here's a code example that I think does something similar to what you are looking for:

import org.apache.lucene.document.Document;
import org.apache.lucene.document.FeatureField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.ByteBuffersDirectory;
import org.apache.lucene.store.Directory;

public class LearnedSparseRetrieval {

  public static void main(String[] args) throws Exception {
    try (Directory dir = new ByteBuffersDirectory()) {
      try (IndexWriter w = new IndexWriter(dir, new IndexWriterConfig())) {
        {
          Document doc = new Document();
          doc.add(new FeatureField("my_feature", "scientific", 200));
          doc.add(new FeatureField("my_feature", "intellect", 202));
          doc.add(new FeatureField("my_feature", "communication", 235));
          w.addDocument(doc);
        }
        {
          Document doc = new Document();
          doc.add(new FeatureField("my_feature", "scientific", 100));
          doc.add(new FeatureField("my_feature", "communication", 350));
          doc.add(new FeatureField("my_feature", "project", 80));
          w.addDocument(doc);
        }
      }

      try (IndexReader reader = DirectoryReader.open(dir)) {
        IndexSearcher searcher = new IndexSearcher(reader);
        Query query = new BooleanQuery.Builder()
            .add(FeatureField.newLinearQuery("my_feature", "scientific", 24), Occur.SHOULD)
            .add(FeatureField.newLinearQuery("my_feature", "communication", 50), Occur.SHOULD)
            .build();
        System.out.println(searcher.explain(query, 0));
        System.out.println();
        System.out.println(searcher.explain(query, 0));
      }
    }
  }

}

which outputs

16550.0 = sum of:
  4800.0 = Linear function on the my_feature field for the scientific feature, computed as w * S from:
    24.0 = w, weight of this function
    200.0 = S, feature value
  11750.0 = Linear function on the my_feature field for the communication feature, computed as w * S from:
    50.0 = w, weight of this function
    235.0 = S, feature value

19900.0 = sum of:
  2400.0 = Linear function on the my_feature field for the scientific feature, computed as w * S from:
    24.0 = w, weight of this function
    100.0 = S, feature value
  17500.0 = Linear function on the my_feature field for the communication feature, computed as w * S from:
    50.0 = w, weight of this function
    350.0 = S, feature value

thongnt99 commented 2 years ago

@ jpountz Great. Thank you very much. I will try it out and see if there is any difference in the scores.

thongnt99 commented 1 year ago

I confirmed @jpountz 's approach working. In my dataset, the indexing time goes down from more than ~ 1 hour to ~ 10 minutes. A small issue, the weight in FeatureField.newLinearQuery is constrained to be in the range of (0, 64]. This is not desirable, but it is fine for now if there is no easy fix.

jpountz commented 1 year ago

This is a good point. This limit was introduced with the idea that FeatureField would be used to incorporate features into a BM25/TFIDF/DFR score and higher weights than 64 would generally be a mistake, but we could lift this limit if it feels like a useful query to use on its own for sparse-learned retrieval.

thongnt99 commented 1 year ago

Yes, I think that would be nicer to have dedicated classes for LSR? Though using FeatureField is efficient, I feels it is still a bit of hacking. If we replaced FeatureQuery with new BoostQuery(new TermQuery(new Term(field, term)), weight), then it doesn't work. So i think there is some internal difference in the indexes created by this approach and the repetition approach.

apache / lucene

Indexing method for learned sparse retrieval #11799

Description