Open eroux opened 3 years ago
After a few experiments, here's a first step in that direction, even though it's just a small hack, let's call etextContentFacetGraphInInstance
with a larger limit on the Lucene query, something like 4000 works well in this case: https://purl.bdrc.io/lib/etextContentFacetGraphInInstance?LG_NAME=bo-x-ewts&R_EINST=bdr%3AIE11577&L_NAME=%22rnam%20%27joms%22&LI_NAME=4000
@xristy I'd be happy to have a little help here on how it works on tbrc.org: basically when I do the search I described above:
search on the text rnam 'joms in gsung 'bum/_kun dga' bzang po (W11577) on tbrc.org yealds 34 results
what happens behind the scenes in terms of Lucene indexes? How does the Lucene documents created by eXide look like? In other words, is W11577 written in the documents themselves (in addition to the UTxx IDs) so that the query matches these directly? Or does the Lucene query just return all possible results in all documents and then eXide filters on W11577 ?
After a few experiments, here's a first step in that direction, even though it's just a small hack, let's call
etextContentFacetGraphInInstance
with a larger limit on the Lucene query, something like 4000 works well in this case: https://purl.bdrc.io/lib/etextContentFacetGraphInInstance?LG_NAME=bo-x-ewts&R_EINST=bdr%3AIE11577&L_NAME=%22rnam%20%27joms%22&LI_NAME=4000
done: https://library.bdrc.io/search?q=%22rnam%20%27joms%22&lg=bo-x-ewts&t=Etext&r=bdr:IE11577
For W11577, tbrc indexes each page of each of the 22 volumes. Visiting W11577 and doing a "Search in eTexts" performs a search just in the volumes of W11577 or the selected volume.
I've looked a bit and the results on tbrc.org vs library.bdrc.io appear identical except for vols 5, 19, and 20. There may be a few more hits in buda than in tbrc if counting highlighted occurrences or a few less if counting chunks w/ at least one occurrence. It will take a closer look to see what diffs are there.
It looks to me like buda is likely more accurate than tbrc
Ah thanks! So, the reason I'm asking is because I'm wondering what kind of Lucene query eXide makes to restrict the search to the volumes of W11577? On BUDA all the etexts of W11577 are in different graphs, so we can't restrict the query to W11577, which leads to two solutions:
but I'm thinking maybe the first technique could be applied better, in the same way it is in eXide... I unfortunately didn't save the SPARQL query I tried but I should be able to reconstruct it
It must have been something along the lines of
construct {
?etext tmp:isMain true .
?etext bdo:eTextHasChunk ?chunk .
?R_EINST skos:prefLabel ?einstanceL .
?chunk bdo:chunkContents ?lit .
?chunk bdo:sliceStartChar ?startChar .
?chunk bdo:sliceEndChar ?endChar .
?chunk tmp:matchScore ?score .
?etext ?etextp ?etexto .
?etext tmp:maxScore ?maxScore .
?etext tmp:nbChunks ?nbchunks .
}
where
{
# not openpecha
?etextg bdo:eTextInInstance bdr:IE1KG14 .
?etextadm adm:adminAbout ?etextg ;
adm:graphId ?g .
(?chunk ?score ?lit) text:query ( :chunkContents "\"rnam 'joms\""@bo-x-ewts 500 "highlight:" ) .
?etext bdo:eTextHasChunk ?chunk .
VALUES ?etextp { skos:prefLabel bdo:eTextIsVolume bdo:eTextInVolume }
?etext ?etextp ?etexto .
?chunk bdo:sliceStartChar ?startChar .
?chunk bdo:sliceEndChar ?endChar .
}
note that I took bdr:IE1KG14 as it's an example with a particularly large number of etexts
In tbrc each volume is a document, ut11577_019.xml, the volumes are in a collection/dir, ut11577, which in turn is in a collection, GuruLamaWorks, which is in the top-level collection, eTextsChunked.
So it's just a matter of a for loop over all the docs in a given collection:
for $vol := collection("/db/eTextsChunked/GuruLamaWorks/UT11577")/tei:TEI
do the search over $vol
This would be like having a list of graphIds for each volume and cycling over them:
bdr:W11577 bdo:hasEtextVolume ?graphId .
graph ?graphId {
do search stuff
}
but I guess this sort of approach takes 8 minutes?
ah I see! grouping by volume should be working indeed... at least when there's not a crazy amount of volumes... but the max is 380 so it's still below the number of texts in UT1KG14!
The approach that takes 8mn is the iteration over each etext, but there are many etexts per volume in the case of UT1KG14. I suppose it's something we should try. Let's see if we can make a new migration next week or early January with this and some improvements in the Lucene analyzer
No action required now but I'm tracking it here as it's a multi-facetted issue:
A search on the text
rnam 'joms
ingsung 'bum/_kun dga' bzang po
(W11577) on tbrc.org yealds 34 results, while the same on bdrc.io is a bit random as in that case it searches throughout all our etexts but limits the number of results in each Lucene search to 1000. For the OpenPechas this is not an issue as the graphs are organized differently and the lucene search is scoped in one graph... but here it's an issue. There are 3 solutions: