Closed xristy closed 6 years ago
I think I've found the bug: if you look at the passing test of lucene-zh highlighter, it shows that the result from the doc should be analyzed with the indexing analyzer, not the query analyzer.
So on this line, qa
should be replaced by the indexing analyzer.
I tried to make the change but I was not really sure how to access the right index indexing analyzer for the field... I think it should be relatively easy for you though... I'll be offline most of the week-end but I hope this helps! (It probably hinders the highlighting for the lenient Sanskrit too).
I also have a trace from lucene-sa test2 which I added highlight to and it shows 1 out of 4 hits being correctly highlighted:
I got this result by using the indexAnalyzer in the highlightResults
I'm not sure what is happening but it seems like the problem is needing to map from the ndia terms that are indexed backwards to the sa-x-deva
or sa-x-iast
but honestly I'm not sure. I just don't quite see what is wrong or missing from the TextIndexLucene TRACE highlightResults[nn]
.
In the case of the TEST0 and TEST1 it might be that the effectiveField
should be rdfsLabel_sa-aux-roman2Ndia
since that would represent the sa-x-iast
in the same terms as the query string; rather, than rdfsLabel_sa-x-iast
which is what the current highlightResults is computing. Just now I don't see how to accomplish that other than analyzing the text:auxIndex
(if any) associated with the docLang
. This would have to be retrieved from Util.getAuxIndexes(lang)
and the list examined for a tag related to the docLang
.
How to know that a text:auxIndex
should be used? It would seem that has to be by comparing the docLang
and the query lang and knowing something about the underlying index term representation. In other words if the query is in sa-x-iast
and the docLang
is sa-x-deva
then presumably it will work simply because the index term representation in both cases is sa-x-slp1
.
So I tried the test2 request.arq
but changed the language tag to sa-x-iast
:
(?s ?sc ?lit) text:query ( rdfs:label "\"sa\""@sa-x-iast "highlight:" )
That produced zero hits. Why? Because the index has word entries such as saNgIti
which isn't going to match sa
which gets transformed to tad
by the query analyzer and parser:
[2018-10-13 20:26:26] TextIndexLucene DEBUG Lucene
queryString:
(rdfsLabel_sa-x-iast:"sa" rdfsLabel_sa-x-slp1:"sa" rdfsLabel_sa-x-iso:"sa" rdfsLabel_sa-deva:"sa" rdfsLabel_sa-alalc97:"sa" )
AND
graph:http\:\/\/purl.bdrc.io\/resource\/testlucenesa,
parsed query:
+(rdfsLabel_sa-x-iast:tad rdfsLabel_sa-x-slp1:tad rdfsLabel_sa-x-iso:tad rdfsLabel_sa-deva:tad rdfsLabel_sa-alalc97:tad)
+graph:http://purl.bdrc.io/resource/testlucenesa,
limit:10000
[2018-10-13 20:26:26] TextQueryPF TRACE resultsToQueryIterator: []
and tad
matches nothing. This looks like a blocking sort of situation brought on by max-match perhaps?
I tried running the lucene-zh/test1
which fails with the indexAnalyzer
during highlighting. I have no understanding of why:
I've stared at this for a few hours and will have to worry about it later. Maybe you can make some headway.
The lucene-zh trace shows that the analyzer used to analyze the literal is not the good one (it should be TC2Pynyin, not TC2SC). Can you push the code you used in the Jena repo? (I didn't look at the Sanskrit, we'll see if it's fixed once the Chinese is)
Thanks for the push, the problem now is that effectiveField
is rdfsLabel_zh-hant
while it should be rdfsLabel_zh-aux-han2pinyin
for the highlighter to work... It's a bit puzzling that the lucene results don't indicate which field matched... but that can be inferred in the following manner:
lang
parameter to highlightResults() as queryLang
Util.getEffectiveField(queryLang, docLang)
that does some magic with the configuration so that getEffectiveField("zh-latn-pinyin", "zh-hant")
returns zh-aux-han2pinyin
(I think that shouldn't be too complex and pre-computed)String effectiveField = useDocLang ? field + "_" + docLang : field;
(L557) with:String effectiveField = useDocLang ? field + "_" + Util.getEffectiveField(queryLang, docLang) : field;
working now
enabling highlighting for lucene-zh as in:
yields empty strings even though expected results are found in the search.
Here is a trace of a use of highlighting with lucene-bo that works:
lucene-bo_test1.txt
Here is a trace of lucene-zh that produces empty string rather than expected highlighting:
lucene-zh_test1.txt
Other than adding a bit of tracing to what Élie already has added, I did modified the
highlightResults
method to use the same QueryAnalyzer as that used byquery$
in the search. This did not solve the problem but did appear to improve thetokenStream
Not sure what the next step is.