not enough results in some etext searches

eroux commented 3 years ago

No action required now but I'm tracking it here as it's a multi-facetted issue:

A search on the text rnam 'joms in gsung 'bum/_kun dga' bzang po (W11577) on tbrc.org yealds 34 results, while the same on bdrc.io is a bit random as in that case it searches throughout all our etexts but limits the number of results in each Lucene search to 1000. For the OpenPechas this is not an issue as the graphs are organized differently and the lucene search is scoped in one graph... but here it's an issue. There are 3 solutions:

allow more Lucene results for this specific query (but there will always be cases where this doesn't work)
change the query so that we scope the Lucene query on multiple graphs in the case of legacy etexts (needs some experimentation) --> update: this doesn't work
change the way the graphs are organized for the etexts. It's not that big a deal but it's a relatively heavy change

eroux commented 3 years ago

After a few experiments, here's a first step in that direction, even though it's just a small hack, let's call etextContentFacetGraphInInstance with a larger limit on the Lucene query, something like 4000 works well in this case: https://purl.bdrc.io/lib/etextContentFacetGraphInInstance?LG_NAME=bo-x-ewts&R_EINST=bdr%3AIE11577&L_NAME=%22rnam%20%27joms%22&LI_NAME=4000

eroux commented 3 years ago

@xristy I'd be happy to have a little help here on how it works on tbrc.org: basically when I do the search I described above:

search on the text rnam 'joms in gsung 'bum/_kun dga' bzang po (W11577) on tbrc.org yealds 34 results

what happens behind the scenes in terms of Lucene indexes? How does the Lucene documents created by eXide look like? In other words, is W11577 written in the documents themselves (in addition to the UTxx IDs) so that the query matches these directly? Or does the Lucene query just return all possible results in all documents and then eXide filters on W11577 ?

berger-n commented 3 years ago

After a few experiments, here's a first step in that direction, even though it's just a small hack, let's call etextContentFacetGraphInInstance with a larger limit on the Lucene query, something like 4000 works well in this case: https://purl.bdrc.io/lib/etextContentFacetGraphInInstance?LG_NAME=bo-x-ewts&R_EINST=bdr%3AIE11577&L_NAME=%22rnam%20%27joms%22&LI_NAME=4000

done: https://library.bdrc.io/search?q=%22rnam%20%27joms%22&lg=bo-x-ewts&t=Etext&r=bdr:IE11577

xristy commented 3 years ago

For W11577, tbrc indexes each page of each of the 22 volumes. Visiting W11577 and doing a "Search in eTexts" performs a search just in the volumes of W11577 or the selected volume.

I've looked a bit and the results on tbrc.org vs library.bdrc.io appear identical except for vols 5, 19, and 20. There may be a few more hits in buda than in tbrc if counting highlighted occurrences or a few less if counting chunks w/ at least one occurrence. It will take a closer look to see what diffs are there.

It looks to me like buda is likely more accurate than tbrc

eroux commented 3 years ago

Ah thanks! So, the reason I'm asking is because I'm wondering what kind of Lucene query eXide makes to restrict the search to the volumes of W11577? On BUDA all the etexts of W11577 are in different graphs, so we can't restrict the query to W11577, which leads to two solutions:

restricting the queries to all the graphs of W11577 in SPARQL, which for some reason gives whacky performance: 8mn for 5 results!!
putting all the etexts of W11577 in one giant graph, and restrict the search to it

but I'm thinking maybe the first technique could be applied better, in the same way it is in eXide... I unfortunately didn't save the SPARQL query I tried but I should be able to reconstruct it

eroux commented 3 years ago

It must have been something along the lines of

construct {
  ?etext tmp:isMain true .
  ?etext bdo:eTextHasChunk ?chunk .
  ?R_EINST skos:prefLabel ?einstanceL .

  ?chunk bdo:chunkContents ?lit .
  ?chunk bdo:sliceStartChar ?startChar .
  ?chunk bdo:sliceEndChar ?endChar .
  ?chunk tmp:matchScore ?score .

  ?etext ?etextp ?etexto .

  ?etext tmp:maxScore ?maxScore .
  ?etext tmp:nbChunks ?nbchunks .
}
where
{
    # not openpecha
    ?etextg bdo:eTextInInstance bdr:IE1KG14 .
    ?etextadm adm:adminAbout ?etextg ;
              adm:graphId ?g .
    (?chunk ?score ?lit) text:query ( :chunkContents "\"rnam 'joms\""@bo-x-ewts 500 "highlight:" ) .
    ?etext bdo:eTextHasChunk ?chunk .
    VALUES ?etextp { skos:prefLabel bdo:eTextIsVolume bdo:eTextInVolume }
    ?etext ?etextp ?etexto .
    ?chunk bdo:sliceStartChar ?startChar .
    ?chunk bdo:sliceEndChar ?endChar .
}

note that I took bdr:IE1KG14 as it's an example with a particularly large number of etexts

xristy commented 3 years ago

In tbrc each volume is a document, ut11577_019.xml, the volumes are in a collection/dir, ut11577, which in turn is in a collection, GuruLamaWorks, which is in the top-level collection, eTextsChunked.

So it's just a matter of a for loop over all the docs in a given collection:

for $vol := collection("/db/eTextsChunked/GuruLamaWorks/UT11577")/tei:TEI
  do the search over $vol

This would be like having a list of graphIds for each volume and cycling over them:

bdr:W11577 bdo:hasEtextVolume ?graphId .
graph ?graphId {
    do search stuff
}

but I guess this sort of approach takes 8 minutes?

eroux commented 3 years ago

ah I see! grouping by volume should be working indeed... at least when there's not a crazy amount of volumes... but the max is 380 so it's still below the number of texts in UT1KG14!

The approach that takes 8mn is the iteration over each etext, but there are many etexts per volume in the case of UT1KG14. I suppose it's something we should try. Let's see if we can make a new migration next week or early January with this and some improvements in the Lucene analyzer

buda-base / public-digital-library

not enough results in some etext searches #397