Open thongnt99 opened 2 years ago
In general I'm +1 for supporting learned sparse retrieval, though, I think it would not be so trivial as it looks.
For a starter perhaps we could utilize terms' payloads to tweak the weights instead of modifying the indexing chain... but there may be some overheads in score calculation.
You can use TermFrequencyAttribute
in the analysis chain to set the frequency directly.
+1 from me too, it'd be great to think through how to support this. Could you explain how the query side would look? Are the queries also sparse vectors with custom impacts?
As a note, we have a FeatureField
field type that accepts key-value pairs and stores the value in TermFrequencyAttribute
. It's designed to help incorporate other storing signals like popularity, page rank, etc. It may not be exactly what we want for this use case, but it could provide some inspiration.
Using TermFrequencyAttribute
to customize the term frequencies you can then create a Query in the normal way and compute BM25 using b==0
then I think you will directly control the similarity scores. Or you might want to write a custom Similarity to be a bit more efficient.
@jtibshirani The query side is same as document side, which is a dictionary of terms and weights. To make it compatible with Lucene, people just repeat the terms with its frequency. This is fine because queries are usually much shorter.
Yes, FeatureField is something similar, but we want a single Field containing a list of key-value pairs or a json formatted.
@msokolov @rmuir @mocobeta: I fould this, which could somehow achieve what we want; But I think it is not so flexible, we need to turn the json file into a token stream formatted as: [<term><delimiter><frequency>......]
... I think this step is redundant. Can we just load the json file directly? For this I think we might have to move away from TokenStream pipeline?
The scoring function is just a dot product, very simple.
What do you think? Your thought is very much appreciated as I am not very familiar with Lucene.
We can form a group to do this if you guys are interested in.
we want a single Field containing a list of key-value pairs or a json formatted
Note that you can add one FeatureField
field to your Lucene document for every key/value pair in your JSON document. The logic of converting from a high-level representation like a JSON map into a low-level representation that Lucene understands feels like something that could be managed on the application side?
Here's a code example that I think does something similar to what you are looking for:
import org.apache.lucene.document.Document;
import org.apache.lucene.document.FeatureField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.ByteBuffersDirectory;
import org.apache.lucene.store.Directory;
public class LearnedSparseRetrieval {
public static void main(String[] args) throws Exception {
try (Directory dir = new ByteBuffersDirectory()) {
try (IndexWriter w = new IndexWriter(dir, new IndexWriterConfig())) {
{
Document doc = new Document();
doc.add(new FeatureField("my_feature", "scientific", 200));
doc.add(new FeatureField("my_feature", "intellect", 202));
doc.add(new FeatureField("my_feature", "communication", 235));
w.addDocument(doc);
}
{
Document doc = new Document();
doc.add(new FeatureField("my_feature", "scientific", 100));
doc.add(new FeatureField("my_feature", "communication", 350));
doc.add(new FeatureField("my_feature", "project", 80));
w.addDocument(doc);
}
}
try (IndexReader reader = DirectoryReader.open(dir)) {
IndexSearcher searcher = new IndexSearcher(reader);
Query query = new BooleanQuery.Builder()
.add(FeatureField.newLinearQuery("my_feature", "scientific", 24), Occur.SHOULD)
.add(FeatureField.newLinearQuery("my_feature", "communication", 50), Occur.SHOULD)
.build();
System.out.println(searcher.explain(query, 0));
System.out.println();
System.out.println(searcher.explain(query, 0));
}
}
}
}
which outputs
16550.0 = sum of:
4800.0 = Linear function on the my_feature field for the scientific feature, computed as w * S from:
24.0 = w, weight of this function
200.0 = S, feature value
11750.0 = Linear function on the my_feature field for the communication feature, computed as w * S from:
50.0 = w, weight of this function
235.0 = S, feature value
19900.0 = sum of:
2400.0 = Linear function on the my_feature field for the scientific feature, computed as w * S from:
24.0 = w, weight of this function
100.0 = S, feature value
17500.0 = Linear function on the my_feature field for the communication feature, computed as w * S from:
50.0 = w, weight of this function
350.0 = S, feature value
@ jpountz Great. Thank you very much. I will try it out and see if there is any difference in the scores.
I confirmed @jpountz 's approach working. In my dataset, the indexing time goes down from more than ~ 1 hour to ~ 10 minutes.
A small issue, the weight in FeatureField.newLinearQuery
is constrained to be in the range of (0, 64]. This is not desirable, but it is fine for now if there is no easy fix.
This is a good point. This limit was introduced with the idea that FeatureField
would be used to incorporate features into a BM25/TFIDF/DFR score and higher weights than 64 would generally be a mistake, but we could lift this limit if it feels like a useful query to use on its own for sparse-learned retrieval.
Yes, I think that would be nicer to have dedicated classes for LSR? Though using FeatureField is efficient, I feels it is still a bit of hacking.
If we replaced FeatureQuery with new BoostQuery(new TermQuery(new Term(field, term)), weight)
, then it doesn't work. So i think there is some internal difference in the indexes created by this approach and the repetition approach.
Description
Recent learned sparse retrieval methods (Splade, uniCOIL) were trained to generate impact score directly (replacing tf-idf score).
For each document, they will generate a json file with terms and weights, e.g.
{";": 80, "the": 161, "of": 85, "and": 27, "to": 24, "was": 47, "as": 27, "their": 96, "what": 40, "over": 123, "only": 123, "important": 186, "project": 208, "success": 215, "meant": 131, "lives": 140, "presence": 180, "scientific": 200, "communication": 235, "thousands": 142, "hundreds": 144, "truly": 170, "hanging": 141, "cloud": 187, "engineers": 127, "achievement": 192, "researchers": 137, "innocent": 181, "manhattan": 244, "impressive": 191, "equally": 163, "##rated": 132, "minds": 137, "atomic": 214, "amid": 201, "##lite": 120, "intellect": 202, "ob": 140}}
Can we make a new feature that could index this type of document efficiently? The current work-around I am aware of is to create a fake document by repeating the terms: e.g.,"the the the the .... of of of of of "
However, this way is not very efficient if the impact score gets bigger and also it requires impact score quantization before indexing. I think it would be very useful for many people if we can index the json files directly with float impact scores.