apache / jena

Apache Jena
https://jena.apache.org/
Apache License 2.0
1.12k stars 652 forks source link

Lucene query with text:prop not working in some cases ? #2094

Closed filak closed 9 months ago

filak commented 1 year ago

Version

4.9.0

Question

This query works - searching all the fields:

select * where {
?s text:query ("beer" 10) .
}

However this query - which should search only in rdfs:label and mt:altLabel fields returns 0 hits :

select * where {
?s text:query (mt:defQuery "beer" 10) .
}

This query returns also 0 hits :

select * where {
?s text:query (mt:includeNotes "beer" 10) .
}

mytest.ttl excerption:

# Text index description
<#indexLucene> 
    a text:TextIndexLucene ;
    text:directory ".../indexes/mytest" ;
    text:entityMap <#entMap> ;
    text:storeValues true ;
    text:analyzer [
       a text:ConfigurableAnalyzer ;
       text:tokenizer text:StandardTokenizer ;
       text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
       ] ;
    text:queryParser text:AnalyzingQueryParser ;
    text:multilingualSupport true ;
    text:propLists (
        [ text:propListProp mt:defQuery ;
          text:props ( 
             rdfs:label
             mt:altLabel
             ) ;
        ]
        [ text:propListProp mt:includeNotes ;
          text:props ( 
             rdfs:label
             mt:altLabel
             mt:note
             ) ;
        ]
    ) ;
     .

<#entMap> 
    a text:EntityMap ;
    text:defaultField     "ftext" ;
    text:entityField      "uri" ;
    text:uidField         "uid" ;
    text:langField        "lang" ;
    text:graphField       "graph" ;
    text:map (
         [ text:field "ftext" ; text:predicate rdfs:label ]
         [ text:field "ftext" ; text:predicate mt:altLabel ]
         [ text:field "ftext" ; text:predicate mt:note ]
         ) .
rvesse commented 12 months ago

Could you provide a complete reproducible example please? We're missing the sample input data that you've used to build the text index that exhibits this behaviour

filak commented 12 months ago

Yes, sure.

I have been trying to use a default query using mt:defQuery - not searching in the note and multiple other fields and a second query using mt:includeNotes - including the note field and possibly other fields.

I have simplified my use case.

Test data - drinks.nt

<http://id.example.test/1>  <http://www.w3.org/2000/01/rdf-schema#label>    "beer"@en .
<http://id.example.test/1>  <http://id.example.test/vocab/#altLabel>    "pint"@en .
<http://id.example.test/2>  <http://id.example.test/vocab/#alt_label>   "ale"@en .
<http://id.example.test/1>  <http://id.example.test/mx/#alt_label>  "pivečko"@cs .
<http://id.example.test/1>  <http://id.example.test/vocab/#note>    "Booze is a pleasure"@en .
<http://id.example.test/1>  <http://id.example.test/vocab/#note>    "Chlast je slast"@cs .
<http://id.example.test/2>  <http://www.w3.org/2000/01/rdf-schema#label>    "wine"@en .
<http://id.example.test/2>  <http://id.example.test/vocab/#altLabel>    "champagne"@en .
<http://id.example.test/2>  <http://id.example.test/vocab/#alt_label>   "burgundy"@en .
<http://id.example.test/2>  <http://id.example.test/mx/#alt_label>  "víno"@cs .
<http://id.example.test/2>  <http://id.example.test/vocab/#note>    "Red or white"@en .
<http://id.example.test/2>  <http://id.example.test/vocab/#note>    "Červené či bílé"@cs .

The config - drinks.ttl

@prefix :        <http://localhost/jena_example/#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb2:    <http://jena.apache.org/2016/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text:    <http://jena.apache.org/text#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix mt:      <http://id.example.test/vocab/#> .
@prefix mx:      <http://id.example.test/mx/#> .

## Initialize text query
[] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
# A TextDataset is a regular dataset with a text index.
text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
# Lucene index
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
# Elasticsearch index
text:TextIndexES    rdfs:subClassOf   text:TextIndex .

## ---------------------------------------------------------------
## This URI must be fixed - it's used to assemble the text dataset.

:text_dataset
    a text:TextDataset ;
    text:dataset   <#dataset> ;
    text:index     <#indexLucene> ;
    .

# A TDB dataset used for RDF storage
<#dataset> 
    a tdb2:DatasetTDB2 ;
    tdb2:location  "d:/Data/jena/databases/drinks" ;
    .

# Text index description
<#indexLucene> 
    a text:TextIndexLucene ;
    text:directory "d:/Data/jena/indexes/drinks" ;
    text:entityMap <#entMap> ;
    text:storeValues true ;
    text:analyzer [
       a text:ConfigurableAnalyzer ;
       text:tokenizer text:StandardTokenizer ;
       text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
       ] ;
    text:queryParser text:AnalyzingQueryParser ;
    text:multilingualSupport true ;
    text:propLists (
        [ text:propListProp mt:defQuery ;
          text:props ( 
             rdfs:label
               mt:altLabel
               mt:alt_label
             ) ;
        ]
        [ text:propListProp mt:includeNotes ;
          text:props ( 
             rdfs:label
               mt:altLabel
               mt:alt_label
               mt:note
             ) ;
        ]
        [ text:propListProp mt:testQuery ;
          text:props ( 
             rdfs:label
               mx:alt_label
             ) ;
        ]
    ) ;
     .

<#entMap> 
    a text:EntityMap ;
    text:defaultField     "ftext" ;
    text:entityField      "uri" ;
    text:uidField         "uid" ;
    text:langField        "lang" ;
    text:graphField       "graph" ;
    text:map (
         [ text:field "ftext" ; text:predicate rdfs:label ]
         [ text:field "ftext" ; text:predicate mt:altLabel ]
         [ text:field "ftext" ; text:predicate mt:alt_label ]
         [ text:field "ftext" ; text:predicate mt:note ]
         ) .

<#service_text_tdb> 
    a fuseki:Service ;
    rdfs:label                      "Drinks TEST" ;
    fuseki:name                     "drinks" ;
    fuseki:serviceQuery             "query" ;
    fuseki:serviceQuery             "sparql" ;
    fuseki:serviceUpdate            "update" ;
    fuseki:serviceUpload            "upload" ;
    fuseki:serviceReadGraphStore    "get" ;
    fuseki:serviceReadWriteGraphStore    "data" ;
    fuseki:dataset                  :text_dataset ;
    .

Load to Jena

  tdb2_tdbloader --loc %FUSEKI_BASE%/databases/drinks _imports/drinks.nt

Index

  java -cp %FUSEKI_HOME%/fuseki-server.jar jena.textindexer --desc=configuration/drinks.ttl

The queries at http://localhost:3030/#/dataset/drinks/query

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:  <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:  <http://www.w3.org/2002/07/owl#>
PREFIX text: <http://jena.apache.org/text#>
PREFIX mt:   <http://id.example.test/vocab/#> 

# Query #1

select * where {
?s text:query ("beer white") 
}

=> 2 hits - OK

# Query #2

select * where {
?s text:query (mt:includeNotes "white") 
}

=> 1 hit - OK

# Query #3

select * where {
?s text:query (mt:defQuery "white") 
}

=> 1 hit - but should it not be 0 ? Because the "white" string is only present in the note field ?

filak commented 12 months ago

Observation 1:

NOT TRUE - see https://github.com/apache/jena/issues/2094#issuecomment-1831912164 - When a field/predicate name in the text:props definition contains underscore _ - ie. alt_label

        [ text:propListProp mt:testQuery ;
          text:props ( 
             rdfs:label
               mt:altLabel
               mx:alt_label
             ) ;
        ]
filak commented 11 months ago

Any updates on this @rvesse ? I have tried to locate in the code what might be happening with the underscored fields but so far I failed.

The underscore is not a reserved char in Lucene.

rvesse commented 11 months ago

Any updates on this @rvesse ? I have tried to locate in the code what might be happening with the underscored fields but so far I failed.

The underscore is not a reserved char in Lucene.

Sorry @filak I have no idea personally, not an area of the code base I'm familiar with.

I'd hoped by you providing more details some of our Jena Text/Lucene experts like @OyvindLGjesdal might be able to take a look and comment on what's going on?

OyvindLGjesdal commented 11 months ago

Hi @filak and thanks for the precise examples, and thanks for the ping.

I have some problems with replicating the issues described.

One thing I notice in the test data is that the mx namespace isn't mentioned. What is the prefix mx: in mx:alt_label, is it just a typo in the example?

I copied one of the existing tests using propLists to recreate the errors, and get the 3 expected results back when using the test-data, and no items back when I tried to replicate the other example.

I first got the warning message

23:03:36 WARN  TextQueryPF :: Predicate not indexed: http://id.example.test/vocab/#alt_label
23:03:36 WARN  TextQueryPF :: objectToStruct: props are not indexed [http://www.w3.org/2004/02/skos/core#prefLabel, http://www.w3.org/2004/02/skos/core#altLabel, http://www.w3.org/2000/01/rdf-schema#label, http://id.example.test/vocab/#alt_label]

during running the test, and had to add it to the text map, and rerun the test without a warning, to get the expected result back.

       "    text:map (",
                    "         [ text:field \"label\" ; text:predicate rdfs:label ; text:noIndex true ]",
                    "         [ text:field \"altLabel\" ; text:predicate skos:altLabel ]",
+                   "         [ text:field \"alt_Label\" ; text:predicate mt:alt_label ]",
                    "         [ text:field \"prefLabel\" ; text:predicate skos:prefLabel ]",
                    "         [ text:field \"comment\" ; text:predicate rdfs:comment ]",
                    "         [ text:field \"workAuthorshipStatement\" ; text:predicate spec:workAuthorshipStatement ]",
                    "         [ text:field \"workEditionStatement\" ; text:predicate spec:workEditionStatement ]",
                    "         [ text:field \"workColophon\" ; text:predicate spec:workColophon ]",
                    "         ) ."

Was the props are not indexed step above silent when running?

Not sure what happens with the second step, but one thing I thought of from the example above, was that maybe there was leftover documents in the lucene folder, if it wasn't deleted during debugging.

I think that lucene deletions on documents aren't part of running the java command for reindexing. My information might be outdated or wrong on this, but we still delete the lucene folder, before running indexing on an offline database, during CI-jobs.

See the two tests which pass at https://github.com/apache/jena/compare/main...OyvindLGjesdal:jena:debug-text-prop-not-working-in-some-cases

I didn't replicate your configuration in the test, so it could also be other stuff that breaks, but hope this helps.

OyvindLGjesdal commented 11 months ago

Could it be that it resolves to the the same lucene text:field "fulltext" from the textmap, when it resolves the propListProp ? The tests which I copied from, used different field names, while your example adds them all to to the same fulltext field. This sounds more likelely, and is maybe a bug? I'll try tonight and see if replicating the example on the text:map, results in the same behavior.

filak commented 11 months ago

Thank you for looking into this @OyvindLGjesdal

I have updated the testing data and the config.

filak commented 11 months ago

Hmmm, so going back to my Observation 1...

The undescore issue is just a red herring - I apologize for the mistake.

I did forgot to include a field in the text:map() - mx:alt_label

  [ text:propListProp mt:testQuery ;
    text:props ( 
       rdfs:label
         mt:altLabel 
         mx:alt_label
       ) ;

<#entMap> 
    a text:EntityMap ;
    text:defaultField     "ftext" ;
    text:entityField      "uri" ;
    text:uidField         "uid" ;
    text:langField        "lang" ;
    text:graphField       "graph" ;
    text:map (
         [ text:field "ftext" ; text:predicate rdfs:label ]
         [ text:field "ftext" ; text:predicate mt:altLabel ]
         [ text:field "ftext" ; text:predicate mt:note ]
         ) .

So any query

select * where {
?s text:query (mt:testQuery "*") 
}

always returning 0 hits.

Is this correct behaviour ? There is a missing field (mx:alt_label) in the props but also an existing field (mt:altLabel) - so maybe the query should return some hits in this case ?

Anyway a prop field missing in the mapping seems to break things so it shall be avoided.

filak commented 11 months ago

The other problem - the query

# Query #3

select * where {
?s text:query (mt:defQuery "white") 
}

returning 1 hit.

I think this should return 0 hits - because the term white is contained in the mt:note field and this field is not included in text:props

        [ text:propListProp mt:defQuery ;
          text:props ( 
             rdfs:label
               mt:altLabel
             ) ;
        ]
rvesse commented 11 months ago

The other problem - the query

# Query #3

select * where {
?s text:query (mt:defQuery "white") 
}

returning 1 hit.

I think this should return 0 hits - because the term white is contained in the mt:note field and this field is not included in text:props

        [ text:propListProp mt:defQuery ;
          text:props ( 
             rdfs:label
               mt:altLabel
             ) ;
        ]

I think this one is caused by the issue identified in https://github.com/apache/jena/issues/2094#issuecomment-1831510414, you map several properties to the same field in the underlying Lucene index. Since the index doesn't store what property text originated from in the index a query on any of those properties that share the same Lucene field can thus return documents that matched based on any of the original input properties textual values.

Not sure whether this a bug or not. It appears to be a side effect of the design choices of how the data is indexed into Lucene. This should maybe be flagged as a configuration and/or query time warning.

To make the query behave as you expect either requires your configuration to change to separate the properties into different fields, or the jena-text code to change how it currently indexes and queries data (which would be a breaking change AFAICT)

filak commented 11 months ago

Maybe the docs need to be more specific about how to do the mapping...

I had started initially with the catch-all ftext field - ie

<#entMap>
...
text:map (
         [ text:field "ftext" ; text:predicate ...
         [ text:field "ftext" ; text:predicate ...
         ...

and a queries like this

   ?s text:query ("whatever")

Then I realized I needed more control over the searching and I started trying to use propLists.

So should I map all the fields separately in the text:map and mix them in the propLists as needed?

What might make sense to me:

  1. Map the fields like this
    text:defaultField     "labels" ;
    ...
    text:map (
         [ text:field "labels" ; text:predicate rdfs:label ]
         [ text:field "labels" ; text:predicate mt:altLabel ]
         [ text:field "labels" ; text:predicate mt:alt_label ]
         [ text:field "labels" ; text:predicate mx:alt_label ]
         [ text:field "notes"  ; text:predicate mx:note ]
         [ text:field "notes"  ; text:predicate mt:note2 ]
         [ text:field "notes"  ; text:predicate mt:note1 ]
  1. Mix and match the labes and the notes as needed in the propLists - ie
    text:propLists (
        [ text:propListProp mt:defQuery ;
          text:props ( 
             labels
             ) ;
        ]
        [ text:propListProp mt:includeNotes ;
          text:props (
             labels 
             notes
             ) ;
        ]

But I have no clue if that is feasible at all or what prefix I should use in this case.

filak commented 11 months ago

I have modified the config:

    text:propLists (
        [ text:propListProp mt:defQuery ;
          text:props ( 
             rdfs:label
             mt:altLabel
             mt:alt_label
             mx:alt_label
             ) ;
        ]
        [ text:propListProp mt:includeNotes ;
          text:props (
               rdfs:label
               mt:altLabel
               mt:alt_label
               mx:alt_label           
               mt:note
             ) ;
        ]
    ) ;
     .

<#entMap> 
    a text:EntityMap ;
    text:defaultField     "ftext" ;
    text:entityField      "uri" ;
    text:uidField         "uid" ;
    text:langField        "lang" ;
    text:graphField       "graph" ;
    text:map (
         [ text:field "ftext" ; text:predicate rdfs:label ]
         [ text:field "ftext" ; text:predicate mt:altLabel ]
         [ text:field "ftext" ; text:predicate mt:alt_label ]
         [ text:field "ftext" ; text:predicate mx:alt_label ]
         [ text:field "note" ; text:predicate mt:note ]
         ) .

Now the queries work as expected !

  ?s text:query (mt:defQuery "white")  => 0 hits

And this also works:

 ?s text:query ("beer white") => 1 hit - <http://id.example.test/1>
 ?s text:query ("white") => 0 hits
 ?s text:query (mt:includeNotes "white beer") => 2 hits (ID 1 + 2)

However these are weird:

 ?s text:query (mt:includeNotes "red booze") => 1 hit - <http://id.example.test/2> ??
 ?s text:query (mt:includeNotes "booze red") => 1 hit - <http://id.example.test/1> ??
rvesse commented 11 months ago

However these are weird:

 ?s text:query (mt:includeNotes "red booze") => 1 hit - <http://id.example.test/2> ??
 ?s text:query (mt:includeNotes "booze red") => 1 hit - <http://id.example.test/1> ??

Yeah those still look off. Seems like something odd may be happening since the order of terms in your query impacts the results returned, glancing at your sample data that query really should match both AFAICT

Could you try increasing your log level to TRACE as looking at the jena-text code it should give a lot of detail about the Lucene query being built at that level?

filak commented 11 months ago

OK, I started Fuseki jar with --debug option and here is the log after running the query:

15:29:37 INFO  Fuseki          :: [5] Query =
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:  <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:  <http://www.w3.org/2002/07/owl#>
PREFIX text: <http://jena.apache.org/text#>
PREFIX mt:   <http://id.example.test/vocab/#>

select * where {
?s text:query (mt:includeNotes "red booze")
}
15:29:37 TRACE TextQueryPF     :: exec: ?s text:query (<http://id.example.test/vocab/#includeNotes> "red booze")
15:29:37 TRACE TextQueryPF     :: objectToStruct: x.isURI(), prop: http://id.example.test/vocab/#includeNotes at idx: 0
15:29:37 TRACE TextQueryPF     :: objectToStruct: PROPERTY at 0 IS http://id.example.test/vocab/#includeNotes WITH pList: [http://www.w3.org/2000/01/rdf-schema#label, http://id.example.test/vocab/#altLabel, http://id.example.test/vocab/#alt_label, http://id.example.test/mx/#alt_label, http://id.example.test/vocab/#note]
15:29:37 TRACE TextQueryPF     :: prepareQuery with subject: ?s; params: ( properties: [http://www.w3.org/2000/01/rdf-schema#label, http://id.example.test/vocab/#altLabel, http://id.example.test/vocab/#alt_label, http://id.example.test/mx/#alt_label, http://id.example.test/vocab/#note]; query: red booze; limit: -1; lang: null; highlight: null )
15:29:37 DEBUG TextQueryPF     :: Text query: red booze <urn:x-arq:DefaultGraphNode> (-1)
15:29:37 TRACE TextQueryPF     :: Caching Text query: red booze with key: >>?s -1 [http://www.w3.org/2000/01/rdf-schema#label, http://id.example.test/vocab/#altLabel, http://id.example.test/vocab/#alt_label, http://id.example.test/mx/#alt_label, http://id.example.test/vocab/#note] red booze null urn:x-arq:DefaultGraphNode<< in cache: org.apache.jena.atlas.lib.cache.CacheCaffeine@2a457ab1
15:29:37 TRACE TextIndexLucene :: query$ PROCESSING LIST of properties: [http://www.w3.org/2000/01/rdf-schema#label, http://id.example.test/vocab/#altLabel, http://id.example.test/vocab/#alt_label, http://id.example.test/mx/#alt_label, http://id.example.test/vocab/#note]; Lucene queryString: ; textFields: [ftext, ftext, ftext, ftext, note]
15:29:37 TRACE TextIndexLucene :: query$ PROCESSED LIST of properties: [http://www.w3.org/2000/01/rdf-schema#label, http://id.example.test/vocab/#altLabel, http://id.example.test/vocab/#alt_label, http://id.example.test/mx/#alt_label, http://id.example.test/vocab/#note] with resulting qString: ftext:red booze ftext:red booze ftext:red booze ftext:red booze note:red booze
15:29:37 WARN  TextIndexLucene :: Deprecated query parser type 'AnalyzingQueryParser'. Defaulting to standard QueryParser
15:29:37 DEBUG TextIndexLucene :: query$ with LIST: [http://www.w3.org/2000/01/rdf-schema#label, http://id.example.test/vocab/#altLabel, http://id.example.test/vocab/#alt_label, http://id.example.test/mx/#alt_label, http://id.example.test/vocab/#note]; INPUT qString: (ftext:red booze ftext:red booze ftext:red booze ftext:red booze note:red booze ) AND graph:urn\:x\-arq\:DefaultGraphNode; with queryParserType: AnalyzingQueryParser; parseQuery with PerFieldAnalyzerWrapper({lang=org.apache.lucene.analysis.core.KeywordAnalyzer@5e0c4f21, uri=org.apache.lucene.analysis.core.KeywordAnalyzer@2c18a3ea, graph=org.apache.lucene.analysis.core.KeywordAnalyzer@166c2c17}, default=MultilingualAnalyzer(default=org.apache.jena.query.text.analyzer.ConfigurableAnalyzer@1df5c7e3)) YIELDS: +(ftext:red ftext:booze ftext:red ftext:booze ftext:red ftext:booze ftext:red ftext:booze note:red ftext:booze) +graph:urn:x-arq:DefaultGraphNode; parsed query: +(ftext:red ftext:booze ftext:red ftext:booze ftext:red ftext:booze ftext:red ftext:booze note:red ftext:booze) +graph:urn:x-arq:DefaultGraphNode; limit: 10000
15:29:37 TRACE TextIndexLucene :: simpleResults[8]: fields: [ftext, ftext, ftext, ftext, note] doc: Document<stored,indexed,tokenized,indexOptions=DOCS<uri:http://id.example.test/2> stored,indexed,tokenized,indexOptions=DOCS<graph:urn:x-arq:DefaultGraphNode> stored,indexed,tokenized<note:Red or white> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<lang:en> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<uid:48bc17f6921b4efff3f082a027a3e2c11037e9262ab743ed174587619543f767>>
15:29:37 TRACE TextQueryPF     :: resultsToQueryIterator CALLED with results: [TextHit{node=http://id.example.test/2 literal="Red or white"@en score=0.58286893 graph=urn:x-arq:DefaultGraphNode prop=http://id.example.test/vocab/#note}]
rvesse commented 11 months ago

ftext:red booze ftext:red booze ftext:red booze ftext:red booze note:red booze

So that looks like a bug to me.

The generated Lucene query is not properly quoting the search string when applying it to each field. Per https://lucene.apache.org/core/9_8_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Fields this means the query is only searching for red in the ftext field and booze in the default field, which does however happen to be ftext judging by the parsed query:

parsed query: +(ftext:red ftext:booze ftext:red ftext:booze ftext:red ftext:booze ftext:red ftext:booze note:red ftext:booze) +graph:urn:x-arq:DefaultGraphNode; limit: 10000

This means that only the first word in your query gets queried in the note field which is why the order of the terms in the query affects the results.

@OyvindLGjesdal does that look like a valid analysis to you?

It also looks like we generate duplicate query clauses when multiple properties map to the same Lucene field which might be unnecessary?

rvesse commented 11 months ago

Although I'm not sure the fix is to just quote the search string because it could itself already be a complex query e.g. "red wine" OR "white beer" which wouldn't work if we blindly surround with "

Maybe Field Grouping is the solution i.e.

ftext:(Red booze) note:(Red booze)

??

OyvindLGjesdal commented 11 months ago

ftext:red booze ftext:red booze ftext:red booze ftext:red booze note:red booze

So that looks like a bug to me.

The generated Lucene query is not properly quoting the search string when applying it to each field. Per https://lucene.apache.org/core/9_8_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Fields this means the query is only searching for red in the ftext field and booze in the default field, which does however happen to be ftext judging by the parsed query:

parsed query: +(ftext:red ftext:booze ftext:red ftext:booze ftext:red ftext:booze ftext:red ftext:booze note:red ftext:booze) +graph:urn:x-arq:DefaultGraphNode; limit: 10000

This means that only the first word in your query gets queried in the note field which is why the order of the terms in the query affects the results.

@OyvindLGjesdal does that look like a valid analysis to you?

It also looks like we generate duplicate query clauses when multiple properties map to the same Lucene field which might be unnecessary?

@rvesse This looks like a valid analysis to me, but this is also unknown parts to me and based on reading the links you posted. But it does it does line up perfectly with the bug, and the solution looks good. I guess it would also just handle inner logic in inner parens.

Thanks for the verbose output and experimenting @filak

I can try to create a test for this during the weekend.

This is probably only a bug in the propList, and not with the use of normal properties? I guess it would have been reported and noticed and caught by tests if this was also in rdfs:label red booze.

On a second note, we should probably update the examples in the docs to remove the examples of setting a custom queryParser that no longer is present in Apache Lucene text:queryParser text:AnalyzingQueryParser ;

15:29:37 WARN  TextIndexLucene :: Deprecated query parser type 'AnalyzingQueryParser'. Defaulting to standard QueryParser

When searching the web, the AnalyzingQueryParser is only present in lower versions of the javadocs.

rvesse commented 11 months ago

This is probably only a bug in the propList, and not with the use of normal properties? I guess it would have been reported and noticed and caught by tests if this was also in rdfs:label red booze.

Not sure, like yourself I'm not too familiar with these parts of the codebase, probably also worth concocting a test case to validate

OyvindLGjesdal commented 11 months ago

I think I can confirm your analysis @rvesse

If I change default field to " text:defaultField \"comment\"

the verbose output expression falls back to using comment(default field), for booze and only the first word redis paired with its text field.

+(ftext:red comment:booze ftext:red comment:booze ftext:red comment:booze ftext:red comment:booze note:red comment:booze)

@filak a workaround fo could be to put () around the text query red booze, I seem to get the expected result from the query, using that.

 "SELECT ?s",
                "WHERE {",
                "  ?s text:query ( mt:includeNotes \"(red booze)\" ) . ",
                "}"

This is the output

22:14:28 DEBUG TextIndexLucene :: query$ with LIST: [http://www.w3.org/2000/01/rdf-schema#label, http://id.example.test/vocab/#altLabel, http://id.example.test/vocab/#alt_label, http://id.example.test/mx/#alt_label, http://id.example.test/vocab/#note]; INPUT qString: (ftext:(red booze) ftext:(red booze) ftext:(red booze) ftext:(red booze) note:(red booze) ) AND graph:urn\:x\-arq\:DefaultGraphNode; with queryParserType: AnalyzingQueryParser; parseQuery with PerFieldAnalyzerWrapper({lang=org.apache.lucene.analysis.core.KeywordAnalyzer@59532566, uri=org.apache.lucene.analysis.core.KeywordAnalyzer@dca2615, graph=org.apache.lucene.analysis.core.KeywordAnalyzer@421a4ee1}, default=MultilingualAnalyzer(default=org.apache.jena.query.text.analyzer.ConfigurableAnalyzer@4f63e3c7)) YIELDS: +((ftext:red ftext:booze) (ftext:red ftext:booze) (ftext:red ftext:booze) (ftext:red ftext:booze) (note:red note:booze)) +graph:urn:x-arq:DefaultGraphNode; parsed query: +((ftext:red ftext:booze) (ftext:red ftext:booze) (ftext:red ftext:booze) (ftext:red ftext:booze) (note:red note:booze)) +graph:urn:x-arq:DefaultGraphNode; limit: 10000
22:14:28 TRACE TextIndexLucene :: simpleResults[10]: fields: [ftext, ftext, ftext, ftext, note] doc: Document<stored,indexed,tokenized,indexOptions=DOCS<uri:http://id.example.test/2> stored,indexed,tokenized,indexOptions=DOCS<graph:urn:x-arq:DefaultGraphNode> stored,indexed,tokenized<note:Red or white> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<lang:en> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<uid:48bc17f6921b4efff3f082a027a3e2c11037e9262ab743ed174587619543f767>>
22:14:28 TRACE TextIndexLucene :: simpleResults[3]: fields: [ftext, ftext, ftext, ftext, note] doc: Document<stored,indexed,tokenized,indexOptions=DOCS<uri:http://id.example.test/1> stored,indexed,tokenized,indexOptions=DOCS<graph:urn:x-arq:DefaultGraphNode> stored,indexed,tokenized<note:Booze is a pleasure> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<lang:en> stored,indexed,tokenized,omitNorms,indexOptions=DOCS<uid:8df507c91a27f4bb554f97c7b5c6b980c48012ab2e23132a879189bbce05fc18>>
22:14:28 TRACE TextQueryPF :: resultsToQueryIterator CALLED with results: [TextHit{node=http://id.example.test/2 literal="Red or white"@en score=0.58286893 graph=urn:x-arq:DefaultGraphNode prop=http://id.example.test/vocab/#note}, TextHit{node=http://id.example.test/1 literal="Booze is a pleasure"@en score=0.51788014 graph=urn:x-arq:DefaultGraphNode prop=http://id.example.test/vocab/#note}]

However this looks like a bug, the intent seems clear that the propList is applied to the entire quoted expression and not the just the first word, also from the examples in the docs.

Same bug happens when using a single property:

SELECT ?s",
                "WHERE {",
                "  ?s text:query ( mt:note \"booze red\" ) . ",
                "}"
INPUT qString: (note:booze red ) AND graph:urn\:x\-arq\:DefaultGraphNode; with queryParserType: AnalyzingQueryParser; parseQuery with PerFieldAnalyzerWrapper({lang=org.apache.lucene.analysis.core.KeywordAnalyzer@59532566, uri=org.apache.lucene.analysis.core.KeywordAnalyzer@dca2615, graph=org.apache.lucene.analysis.core.KeywordAnalyzer@421a4ee1}, default=MultilingualAnalyzer(default=org.apache.jena.query.text.analyzer.ConfigurableAnalyzer@4f63e3c7)) YIELDS: +(note:booze comment:red) +graph:urn:x-arq:DefaultGraphNode; parsed query: +(note:booze comment:red) +graph:urn:x-arq:DefaultGraphNode; limit: 10000

and just one hit:

 [TextHit{node=http://id.example.test/1 literal="Booze is a pleasure"@en score=0.51788014 graph=urn:x-arq:DefaultGraphNode prop=http://id.example.test/vocab/#note}]

The bug remains also if the assembler config is minimized and just use all text default configs.

https://github.com/apache/jena/compare/main...OyvindLGjesdal:debug-text-prop-not-working-in-some-cases?expand=1

filak commented 9 months ago

Any updates on this @OyvindLGjesdal ?

OyvindLGjesdal commented 9 months ago

I think the Pull request is completed from my side, and is in the process for being reviewed.