PDF full text search integrated with XML search

kohsah commented 6 years ago

Currently only the Akoma Ntoso XML metadata document is searchable in Gawati. We want to have the PDF also searchable. Currently:

PDFs are stored on the file system – they are identified by the name and path.

Akoma Ntoso XML metadata for gawati is stored in the XML database – they are identified by the Akoma Ntoso IRI metadatata, primarily FRBRExpression/FRBRthis/@value [1] . The XML document has info about its corresponding PDF document [2] via the componentRef element. [1] --

<an:identification source="#gawati">
            <an:FRBRWork>
                <an:FRBRthis value="/akn/mr/act/1951-11-16/gn_no_214-1951/!main"/>
                <an:FRBRuri value="/akn/mr/act/1951-11-16/gn_no_214-1951"/>
                <an:FRBRdate name="Work Date" date="1951-11-16"/>
                <an:FRBRauthor href="#author"/>
                <an:FRBRcountry value="mr" showAs="Mauritania"/>
                <an:FRBRnumber value="gn_no_214-1951" showAs="GN No. 214/1951"/>
                <an:FRBRprescriptive value="false"/>
                <an:FRBRauthoritative value="false"/>
            </an:FRBRWork>
            <an:FRBRExpression>
                <an:FRBRthis value="/akn/mr/act/1951-11-16/gn_no_214-1951/eng@/!main"/>
                <an:FRBRuri value="/akn/mr/act/1951-11-16/gn_no_214-1951/eng@"/>
                <an:FRBRdate name="Expression Date" date="1951-11-16"/>
                <an:FRBRauthor href="#author"/>
                <an:FRBRlanguage language="eng"/>
            </an:FRBRExpression>
            <an:FRBRManifestation>
                <an:FRBRthis value="/akn/mr/act/1951-11-16/gn_no_214-1951/eng@/!main.xml"/>
                <an:FRBRuri value="/akn/mr/act/1951-11-16/gn_no_214-1951/eng@/.akn"/>
                <an:FRBRdate name="Manifestation Date" date="2016-03-04"/>
                <an:FRBRauthor href="#author"/>
                <an:FRBRformat value="xml"/>
            </an:FRBRManifestation>
        </an:identification>

[2] –

<an:body>
            <an:book refersTo="#mainDocument">
                <an:componentRef src="/akn/mr/act/1951-11-16/gn_no_214-1951/eng@/!main.pdf" alt="akn_mr_act_1951-11-16_gn_no_214-1951_eng_main.pdf" GUID="#embedded-doc-1" showAs="Electricity (Amendment) Regulations, 1951 (Amended)"/>
            </an:book>
        </an:body>

We want to index the PDF and connect it with the already indexed Akoma Ntoso (AKN) XML metadata. To index the PDF we have: PDF to XML - pdf to xml which produces a generic XML file (page by page) out of OCR-ed PDF documents which can be indexed and searched.

Step 1 )

Iterate through each AKN document, and for each corresponding PDF associated with it, run PDF to XML. Connect the produced XML to the Akoma Ntoso XML document, by introducing the FRBRExpression/FRBRthis/@value into it so it can be used as a metadata to connect the 2 documents. Give the produced document a coherent naming convention in-line with how the Akoma Ntoso metadata xml documents are named, the produced document has to be stored in the same collection in the XML db like the AKN XML metadata documents. Once this is done, move to Step 2 .

Step 2)

Add index configurations for the produced XML documents. You will need to index the pages for full text, and the bridge metadata FRRBthis/@value for a range type index. (see https://exist-db.org/exist/apps/doc/indexing.xml )

Step 3)

Create a search service for search for full text for a particular IRI. You will need to add this service to https://github.com/gawati/gawati-data/ . You can find many existing services defined in https://github.com/gawati/gawati-data/blob/dev/services/services.xql / https://github.com/gawati/gawati-data/blob/dev/services/services-json.xql (Note JSON or XML are just outputs in eXist, the internal format is always XML, you just set the output-type method to json and output mimetype to json and the service will output JSON instead of XML ) .

Step 4)

Once the service is implemented – integrate the service into the UI (https://github.com/gawati/gawati-portal-ui ) . Implement a search on the document page e.g. : https://alldev.gawati.org/#/doc/_lang/en/_iri/akn/ng/act/2014-09-08/hb_1302471/eng@/!main Which allows searching within the document. Add a tab called "Search" after "Metadata" which provides a search box and shows the full text search results in the tab.

kohsah commented 6 years ago

See line 35 : https://github.com/gawati/gawati-data/commit/818126239d873f1c95f327fe1025b3b0b1e7128c#diff-179b204bdcd6475d855cff3218784e29R35

this query will ignore the index since basically contains() is a substring match.

You have defined a lucene index:

     <lucene>
        <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
        <analyzer id="ws" class="org.apache.lucene.analysis.core.WhitespaceAnalyzer"/>
        <text qname="pages" analyzer="ws"/>
     </lucene>

but the syntax above:

document{$doc}/pdfft:pages/pdfft:page[contains(normalize-space(.),$term)]

is for a range index.

You need to use lucene query syntax for the query processor to actually use the full text index (see https://exist-db.org/exist/apps/doc/lucene.xml )

I would also declare a ngram index on the page (see https://exist-db.org/exist/apps/doc/ngram.xml) since we have documents in non-english languages...

ashwinibm commented 6 years ago

Updated search query to combine results using lucene and ngram indexes. Lucene's whitespace analyser is case sensitive. So I've switched to using the standard analyser. Ngram covers some of the whitespace cases that the standard analyser does not. E.g: 'rs' within 'teachers'. Also using lucene's phrasal search.

kohsah commented 6 years ago

Implemented, pending full test and merge

gawati / gawati-data