buda-base / jena-issues

place to register issues with jena before raising at apache
0 stars 0 forks source link

highlight lucene-zh fails #2

Closed xristy closed 6 years ago

xristy commented 6 years ago

enabling highlighting for lucene-zh as in:

select ?s ?lit
where {
    GRAPH bdr:testlucenezh {
        (?s ?sc ?lit) text:query ( rdfs:label "\"zhū\""@zh-latn-pinyin "highlight:" )
    }
} order by ?s ?sc ?lit

yields empty strings even though expected results are found in the search.

Here is a trace of a use of highlighting with lucene-bo that works:

lucene-bo_test1.txt

Here is a trace of lucene-zh that produces empty string rather than expected highlighting:

lucene-zh_test1.txt

Other than adding a bit of tracing to what Élie already has added, I did modified the highlightResults method to use the same QueryAnalyzer as that used by query$ in the search. This did not solve the problem but did appear to improve the tokenStream

Not sure what the next step is.

eroux commented 6 years ago

I think I've found the bug: if you look at the passing test of lucene-zh highlighter, it shows that the result from the doc should be analyzed with the indexing analyzer, not the query analyzer.

So on this line, qa should be replaced by the indexing analyzer.

I tried to make the change but I was not really sure how to access the right index indexing analyzer for the field... I think it should be relatively easy for you though... I'll be offline most of the week-end but I hope this helps! (It probably hinders the highlighting for the lenient Sanskrit too).

xristy commented 6 years ago

I also have a trace from lucene-sa test2 which I added highlight to and it shows 1 out of 4 hits being correctly highlighted:

lucene-sa_test2.txt

I got this result by using the indexAnalyzer in the highlightResults

I'm not sure what is happening but it seems like the problem is needing to map from the ndia terms that are indexed backwards to the sa-x-deva or sa-x-iast but honestly I'm not sure. I just don't quite see what is wrong or missing from the TextIndexLucene TRACE highlightResults[nn].

In the case of the TEST0 and TEST1 it might be that the effectiveField should be rdfsLabel_sa-aux-roman2Ndia since that would represent the sa-x-iast in the same terms as the query string; rather, than rdfsLabel_sa-x-iast which is what the current highlightResults is computing. Just now I don't see how to accomplish that other than analyzing the text:auxIndex (if any) associated with the docLang. This would have to be retrieved from Util.getAuxIndexes(lang) and the list examined for a tag related to the docLang.

How to know that a text:auxIndex should be used? It would seem that has to be by comparing the docLang and the query lang and knowing something about the underlying index term representation. In other words if the query is in sa-x-iast and the docLang is sa-x-deva then presumably it will work simply because the index term representation in both cases is sa-x-slp1.

So I tried the test2 request.arq but changed the language tag to sa-x-iast:

(?s ?sc ?lit) text:query ( rdfs:label "\"sa\""@sa-x-iast "highlight:" )

That produced zero hits. Why? Because the index has word entries such as saNgIti which isn't going to match sa which gets transformed to tad by the query analyzer and parser:

[2018-10-13 20:26:26] TextIndexLucene DEBUG Lucene 
  queryString: 
    (rdfsLabel_sa-x-iast:"sa" rdfsLabel_sa-x-slp1:"sa" rdfsLabel_sa-x-iso:"sa" rdfsLabel_sa-deva:"sa" rdfsLabel_sa-alalc97:"sa" ) 
    AND 
    graph:http\:\/\/purl.bdrc.io\/resource\/testlucenesa, 
  parsed query: 
    +(rdfsLabel_sa-x-iast:tad rdfsLabel_sa-x-slp1:tad rdfsLabel_sa-x-iso:tad rdfsLabel_sa-deva:tad rdfsLabel_sa-alalc97:tad) 
    +graph:http://purl.bdrc.io/resource/testlucenesa,
    limit:10000

[2018-10-13 20:26:26] TextQueryPF TRACE resultsToQueryIterator: []

and tad matches nothing. This looks like a blocking sort of situation brought on by max-match perhaps?

I tried running the lucene-zh/test1 which fails with the indexAnalyzer during highlighting. I have no understanding of why:

lucene-zh_test1c.txt

I've stared at this for a few hours and will have to worry about it later. Maybe you can make some headway.

eroux commented 6 years ago

The lucene-zh trace shows that the analyzer used to analyze the literal is not the good one (it should be TC2Pynyin, not TC2SC). Can you push the code you used in the Jena repo? (I didn't look at the Sanskrit, we'll see if it's fixed once the Chinese is)

eroux commented 6 years ago

Thanks for the push, the problem now is that effectiveField is rdfsLabel_zh-hant while it should be rdfsLabel_zh-aux-han2pinyin for the highlighter to work... It's a bit puzzling that the lucene results don't indicate which field matched... but that can be inferred in the following manner:

String effectiveField = useDocLang ? field + "_" + Util.getEffectiveField(queryLang, docLang) : field;
eroux commented 6 years ago

working now