eeditiones / tei-publisher-app

The main TEI Publisher app
https://teipublisher.com
GNU General Public License v3.0
68 stars 33 forks source link

[develop] incorrect facet counts when different namespaces create the same facet dimension #156

Closed tuurma closed 1 year ago

tuurma commented 1 year ago

How to reproduce

NB: on current master

http://localhost:8080/exist/apps/tei-publisher/index.html?query=better&language=en-GB&collection=&sort=title&field=text

This query produces 8 hits, 6 in TEI documents, 1 in DocBook (documentation.xml) and 1 in JATS (e-editiones-article.xml)

Nevertheless, facet counts are tripled

image

While effectively the same query returns correct facet counts

declare namespace tei="http://www.tei-c.org/ns/1.0";
declare namespace dbk="http://docbook.org/ns/docbook";

let $hits := 
collection('/db/apps/tei-publisher/data')//tei:text[ft:query(., 'better')] |
collection('/db/apps/tei-publisher/data')//dbk:article[ft:query(., 'better')] |
collection('/db/apps/tei-publisher/data')//body[ft:query(., 'better')]

return     (count($hits), ft:facets($hits, 'genre', 100))
image

Afaics modules/query.xql/query-metadata() retrieves the exact same dataset, nevertheless the result of calling ft:facets() differs, namely looks like the facet counts are merged. Already retrieving facet counts in query-tei.xql/query-metadata() returns the count for the label Documentation while no TEI documents would have this facet set but this happens iff similar calls to query-jats and query-db are also made.

In the example above the results are tripled, removing indexes for e.g. JATS results in doubled counts.

wolfgangmm commented 1 year ago

Your query works because you are not querying a field. As soon as a field is used (e.g. "file:*"), the counts become wrong. How fields and facets should be connected here, I can't say. I'm also not sure if this can be considered a bug in eXist's Lucene index or not. In general, fields should have a unique name and we've been using the same name across different context. This might be wrong.

I changed the code to prefix the fields of each context (with e.g. 'jats.' or 'dbk.').