Closed kohsah closed 6 years ago
See line 35 : https://github.com/gawati/gawati-data/commit/818126239d873f1c95f327fe1025b3b0b1e7128c#diff-179b204bdcd6475d855cff3218784e29R35
this query will ignore the index since basically contains() is a substring match.
You have defined a lucene index:
<lucene>
<analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
<analyzer id="ws" class="org.apache.lucene.analysis.core.WhitespaceAnalyzer"/>
<text qname="pages" analyzer="ws"/>
</lucene>
but the syntax above:
document{$doc}/pdfft:pages/pdfft:page[contains(normalize-space(.),$term)]
is for a range index.
You need to use lucene query syntax for the query processor to actually use the full text index (see https://exist-db.org/exist/apps/doc/lucene.xml )
I would also declare a ngram index on the page (see https://exist-db.org/exist/apps/doc/ngram.xml) since we have documents in non-english languages...
Updated search query to combine results using lucene and ngram indexes. Lucene's whitespace analyser is case sensitive. So I've switched to using the standard analyser. Ngram covers some of the whitespace cases that the standard analyser does not. E.g: 'rs' within 'teachers'. Also using lucene's phrasal search.
Implemented, pending full test and merge
Currently only the Akoma Ntoso XML metadata document is searchable in Gawati. We want to have the PDF also searchable. Currently:
FRBRExpression/FRBRthis/@value
[1] . The XML document has info about its corresponding PDF document [2] via the componentRef element. [1] --[2] –
We want to index the PDF and connect it with the already indexed Akoma Ntoso (AKN) XML metadata. To index the PDF we have: PDF to XML - pdf to xml which produces a generic XML file (page by page) out of OCR-ed PDF documents which can be indexed and searched.
Step 1 )
Iterate through each AKN document, and for each corresponding PDF associated with it, run PDF to XML. Connect the produced XML to the Akoma Ntoso XML document, by introducing the
FRBRExpression/FRBRthis/@value
into it so it can be used as a metadata to connect the 2 documents. Give the produced document a coherent naming convention in-line with how the Akoma Ntoso metadata xml documents are named, the produced document has to be stored in the same collection in the XML db like the AKN XML metadata documents. Once this is done, move to Step 2 .Step 2)
Add index configurations for the produced XML documents. You will need to index the pages for full text, and the bridge metadata FRRBthis/@value for a range type index. (see https://exist-db.org/exist/apps/doc/indexing.xml )
Step 3)
Create a search service for search for full text for a particular IRI. You will need to add this service to https://github.com/gawati/gawati-data/ . You can find many existing services defined in https://github.com/gawati/gawati-data/blob/dev/services/services.xql / https://github.com/gawati/gawati-data/blob/dev/services/services-json.xql (Note JSON or XML are just outputs in eXist, the internal format is always XML, you just set the output-type method to json and output mimetype to json and the service will output JSON instead of XML ) .
Step 4)
Once the service is implemented – integrate the service into the UI (https://github.com/gawati/gawati-portal-ui ) . Implement a search on the document page e.g. : https://alldev.gawati.org/#/doc/_lang/en/_iri/akn/ng/act/2014-09-08/hb_1302471/eng@/!main Which allows searching within the document. Add a tab called "Search" after "Metadata" which provides a search box and shows the full text search results in the tab.