eclipse-rdf4j / rdf4j

Eclipse RDF4J: scalable RDF for Java
https://rdf4j.org/
BSD 3-Clause "New" or "Revised" License
363 stars 164 forks source link

Setup Lucene document threshold at query time #5149

Open ate47 opened 3 weeks ago

ate47 commented 3 weeks ago

Problem description

Inside the LuceneSail, an option exists to tell what is the amount of documents to retrieve. It is set with the property maxDocuments during the sail init and can't be set at runtime. We would need an option to allow this.

Preferred solution

An option would be to add a virtual predicate during query time like search:maxDocuments with a property during init to enable or disable it, maybe allowQueryMaxDocument.

My main concern is to implement it. The maxDocuments values is used inside each index implementation.

// SolrIndex.java
public QueryResponse search(SolrQuery query) throws SolrServerException, IOException {}
// LuceneIndex.java
public synchronized TopDocs search(Query query) throws IOException {}
// ElasticsearchIndex.java
public SearchHits search(SearchRequestBuilder request, QueryBuilder query) {}

So the different methods will need to have a max documents param. I would like an opinion on it before implementing it.

Are you interested in contributing a solution yourself?

Yes

Alternatives you've considered

No response

Anything else?

https://github.com/ec-doris/kohesio-backend/issues/268

hmottestad commented 3 weeks ago

I'll have to take a look at the code to understand how things are handled at the moment.

ate47 commented 1 week ago

Any news on what to do?

hmottestad commented 1 week ago
    public QueryResponse search(SolrQuery query) throws SolrServerException, IOException {
        int nDocs;
        if (maxDocs > 0) {
            nDocs = maxDocs;
        } else {
            long docCount = client.query(query.setRows(0)).getResults().getNumFound();
            nDocs = Math.max((int) Math.min(docCount, Integer.MAX_VALUE), 1);
        }
        return client.query(query.setRows(nDocs));
    }

For this code we can probably allow the number of rows to be set on the query before this method is called. Something like if(query.getRows() != null).... to check if it is already set.

hmottestad commented 1 week ago

Looks like this here is where we extract variables from the SPARQL query: https://github.com/eclipse-rdf4j/rdf4j/blob/755a631efbc95df52e4e590984ab9b0049d4fef5/core/sail/lucene-api/src/main/java/org/eclipse/rdf4j/sail/lucene/QuerySpecBuilder.java#L450

So that can be extended to add ROWS. And then follow the code up to see where else we need to extend it so that the field is set correctly on the SolrQuery object.

hmottestad commented 1 week ago

We can't do the same with public synchronized TopDocs search(Query query) throws IOException {} since the query object doesn't have any info about the number of rows or equivalent. So there we would probably have to make a new method public synchronized TopDocs search(Query query, int nDocs) throws IOException {}.

Same for public SearchHits search(SearchRequestBuilder request, QueryBuilder query) {}.

ate47 commented 1 day ago

@hmottestad do you think you can get a look at how I have implemented it in the PR #5163?

hmottestad commented 17 hours ago

Thanks for the reminder :)