dasch-swiss / dsp-api

DaSCH Service Platform API
http://admin.dasch.swiss
Apache License 2.0
74 stars 18 forks source link

Experiment with supporting eXist-db #1570

Open benjamingeer opened 4 years ago

benjamingeer commented 4 years ago

Find paragraphs containing the word "paphlagonian" marked as a noun:

xquery version "3.1";

for $par in collection("/db/books")//p[.//noun[starts-with(lower-case(.), "paphlagonian")]]
return <result doc="{util:document-name(root($par))}">{$par}</result>

But this returns two <result> elements for the same document. How can I return one <result> (possibly containing multiple matching paragraphs) per matching document?

tobiasschweizer commented 4 years ago

I had a similar query:

<regions>
<h1>{ count(collection("...?select=Med_*.xml")//region) } regions defined </h1>

<ul> {
for $med in collection("...?select=Med_*.xml")/meditatio
order by number($med/@id)

return  <li> Meditatio {data($med/@id)} <ul>
    { 
    for $reg in $med//region

    return <li id="{data($reg/@id)}"> {data($reg/@name)} </li>
    } 
</ul></li>
}</ul>
</regions>

I just iterate over the documents' root elements (one element per document). However, per root element there could still be several occurrences of the same element.

tobiasschweizer commented 4 years ago

What about group by?

https://stackoverflow.com/questions/14030255/xquery-group-by-and-count

benjamingeer commented 4 years ago

Thanks, group by does it!

xquery version "3.1";

for $par in collection("/db/books")//p[.//noun[starts-with(lower-case(.), "paphlagonian")]]
group by $doc := util:document-name(root($par))
return <result doc="{$doc}">{$par}</result>
benjamingeer commented 4 years ago

Search for paphlagonian as an adjective and soul as a noun, in the same paragraph (https://github.com/dhlab-basel/knora-large-texts/issues/2#issuecomment-541031095):

xquery version "3.1";

for $par in collection("/db/books")//p[.//adj[starts-with(lower-case(.), "paphlagonian")] and .//noun[starts-with(lower-case(.), "soul")]]
group by $doc := util:document-name(root($par))
return <result doc="{$doc}">{$par}</result>
benjamingeer commented 4 years ago

Uploading books to eXist-db:

[info] Uploaded wealth-of-nations (6135800 bytes, 7885 ms)
[info] Uploaded the-count-of-monte-cristo (8071112 bytes, 10667 ms)
[info] Uploaded sherlock-holmes (1793512 bytes, 2358 ms)
[info] Uploaded federalist-papers (3189160 bytes, 5162 ms)
[info] Uploaded pride-and-prejudice (1933853 bytes, 2134 ms)
[info] Uploaded the-idiot (4063305 bytes, 5206 ms)
[info] Uploaded jane-eyre (2974229 bytes, 3780 ms)
[info] Uploaded king-james-bible (12533982 bytes, 15746 ms)
[info] Uploaded anna-karenina (5898594 bytes, 7536 ms)
[info] Uploaded annals-of-the-turkish-empire (3204177 bytes, 4245 ms)
[info] Uploaded madame-bovary (1978766 bytes, 2373 ms)
[info] Uploaded hard-times (1807815 bytes, 2160 ms)
[info] Uploaded a-tale-of-two-cities (2291963 bytes, 2671 ms)
[info] Uploaded the-city-of-god-vol-1 (3716365 bytes, 4989 ms)
[info] Uploaded ulysses (4662436 bytes, 6224 ms)
[info] Uploaded complete-works-of-shakespeare (16305050 bytes, 21421 ms)
[info] Uploaded the-canterbury-tales (4565606 bytes, 6269 ms)
[info] Uploaded the-city-of-god-vol-2 (3928595 bytes, 5195 ms)
[info] Uploaded mysterious-island (3362922 bytes, 4662 ms)
[info] Uploaded the-adventures-of-huckleberry-finn (1786403 bytes, 2184 ms)
[info] Uploaded notre-dame-de-paris (3239650 bytes, 4552 ms)
[info] Uploaded maupassant-stories (7934518 bytes, 10532 ms)
[info] Uploaded twenty-years-after (4393927 bytes, 5841 ms)
[info] Uploaded war-and-peace (9535752 bytes, 12614 ms)
[info] Uploaded don-quixote (6812100 bytes, 8834 ms)
[info] Uploaded ivanhoe (3327831 bytes, 4356 ms)
[info] Uploaded gullivers-travels (1671263 bytes, 1966 ms)
[info] Uploaded rizal (2940565 bytes, 3873 ms)
[info] Uploaded plato-republic (3387799 bytes, 4281 ms)
[info] Uploaded from-the-earth-to-the-moon (1599378 bytes, 1823 ms)
[info] Uploaded plutarch-lives (11621868 bytes, 14582 ms)
[info] Uploaded our-mutual-friend (5671376 bytes, 7301 ms)
[info] Uploaded little-dorrit (5675385 bytes, 7285 ms)
[info] Uploaded moby-dick (3645883 bytes, 4744 ms)
[info] Uploaded great-expectations (2980590 bytes, 3832 ms)
[info] Uploaded les-miserables (9701154 bytes, 12816 ms)
[info] Uploaded swanns-way (3029412 bytes, 4114 ms)
[info] Uploaded the-iliad (3366140 bytes, 4433 ms)
[info] Uploaded mahabarata-vol-3 (13768426 bytes, 18576 ms)
[info] Uploaded dracula (2469943 bytes, 2851 ms)
[info] Uploaded mahabarata-vol-2 (11663472 bytes, 15427 ms)
[info] Uploaded little-women (2977054 bytes, 3855 ms)
[info] Uploaded mahabarata-vol-1 (10869925 bytes, 14494 ms)
[info] Uploaded emma (2568815 bytes, 2827 ms)
[info] Uploaded grimms-fairy-tales (1648818 bytes, 1928 ms)
[info] Uploaded mahabarata-vol-4 (7486990 bytes, 10005 ms)
[info] Uploaded the-scarlet-letter (1437007 bytes, 1619 ms)
[info] Uploaded wuthering-heights (1855275 bytes, 1994 ms)
[info] Uploaded the-brothers-karamazov (5869995 bytes, 7730 ms)
benjamingeer commented 4 years ago

With all the books uploaded, the query in https://github.com/dasch-swiss/knora-api/issues/1570#issuecomment-571480042 takes 8 seconds. Knora did it it in 1 second, using Lucene to optimise the query. I'm going to see if I can do a similar optimisation in eXist-db.

benjamingeer commented 4 years ago

It looks like eXist-db can use Lucene to optimise the query, but there's a limitation: you have to configure the Lucene index (in a configuration file), specifying the names of the XML elements whose content you want to index:

It is important make sure to choose the right context for an index, which has to be the same as in your query.

http://exist-db.org/exist/apps/doc/lucene.xml

So for example, I could create an index on all <p> elements. This would mean that we would need to update eXist's Lucene configuration for each project.

benjamingeer commented 4 years ago

Or I guess I could just make an index on the root XML element, which is in effect what Knora does.

benjamingeer commented 4 years ago

Ah, wait a minute, these configuration "files" are actually XML documents. So it wouldn't be a problem to create one per project.

http://exist-db.org/exist/apps/doc/indexing#idxconf

benjamingeer commented 4 years ago

With Lucene indexes on <noun>, <verb>, and <adj>, uploading is somewhat slower:

[info] Uploaded wealth-of-nations (6135800 bytes, 9534 ms)
[info] Uploaded the-count-of-monte-cristo (8071112 bytes, 13063 ms)
[info] Uploaded sherlock-holmes (1793512 bytes, 2314 ms)
[info] Uploaded federalist-papers (3189160 bytes, 4280 ms)
[info] Uploaded pride-and-prejudice (1933853 bytes, 2348 ms)
[info] Uploaded the-idiot (4063305 bytes, 6440 ms)
[info] Uploaded jane-eyre (2974229 bytes, 4022 ms)
[info] Uploaded king-james-bible (12533982 bytes, 17571 ms)
[info] Uploaded anna-karenina (5898594 bytes, 8371 ms)
[info] Uploaded annals-of-the-turkish-empire (3204177 bytes, 5129 ms)
[info] Uploaded madame-bovary (1978766 bytes, 2417 ms)
[info] Uploaded hard-times (1807815 bytes, 2229 ms)
[info] Uploaded a-tale-of-two-cities (2291963 bytes, 2831 ms)
[info] Uploaded the-city-of-god-vol-1 (3716365 bytes, 5511 ms)
[info] Uploaded ulysses (4662436 bytes, 6785 ms)
[info] Uploaded complete-works-of-shakespeare (16305050 bytes, 26064 ms)
[info] Uploaded the-canterbury-tales (4565606 bytes, 6487 ms)
[info] Uploaded the-city-of-god-vol-2 (3928595 bytes, 5798 ms)
[info] Uploaded mysterious-island (3362922 bytes, 4635 ms)
[info] Uploaded the-adventures-of-huckleberry-finn (1786403 bytes, 2410 ms)
[info] Uploaded notre-dame-de-paris (3239650 bytes, 4527 ms)
[info] Uploaded maupassant-stories (7934518 bytes, 11952 ms)
[info] Uploaded twenty-years-after (4393927 bytes, 6188 ms)
[info] Uploaded war-and-peace (9535752 bytes, 14286 ms)
[info] Uploaded don-quixote (6812100 bytes, 10358 ms)
[info] Uploaded ivanhoe (3327831 bytes, 5505 ms)
[info] Uploaded gullivers-travels (1671263 bytes, 2096 ms)
[info] Uploaded rizal (2940565 bytes, 4087 ms)
[info] Uploaded plato-republic (3387799 bytes, 4579 ms)
[info] Uploaded from-the-earth-to-the-moon (1599378 bytes, 2489 ms)
[info] Uploaded plutarch-lives (11621868 bytes, 16800 ms)
[info] Uploaded our-mutual-friend (5671376 bytes, 8805 ms)
[info] Uploaded little-dorrit (5675385 bytes, 9305 ms)
[info] Uploaded moby-dick (3645883 bytes, 5393 ms)
[info] Uploaded great-expectations (2980590 bytes, 4152 ms)
[info] Uploaded les-miserables (9701154 bytes, 14922 ms)
[info] Uploaded swanns-way (3029412 bytes, 4086 ms)
[info] Uploaded the-iliad (3366140 bytes, 4902 ms)
[info] Uploaded mahabarata-vol-3 (13768426 bytes, 21575 ms)
[info] Uploaded dracula (2469943 bytes, 3168 ms)
[info] Uploaded mahabarata-vol-2 (11663472 bytes, 17800 ms)
[info] Uploaded little-women (2977054 bytes, 4300 ms)
[info] Uploaded mahabarata-vol-1 (10869925 bytes, 16502 ms)
[info] Uploaded emma (2568815 bytes, 3275 ms)
[info] Uploaded grimms-fairy-tales (1648818 bytes, 2074 ms)
[info] Uploaded mahabarata-vol-4 (7486990 bytes, 11513 ms)
[info] Uploaded the-scarlet-letter (1437007 bytes, 1790 ms)
[info] Uploaded wuthering-heights (1855275 bytes, 2255 ms)
[info] Uploaded the-brothers-karamazov (5869995 bytes, 8914 ms)
benjamingeer commented 4 years ago

Searches optimised with project-specific Lucene indexes

Search for <adj>Paphlagonian</adj>: 1 result in 0.05 second:

for $par in collection("/db/books")//p[.//adj[ft:query(., "paphlagonian")]]
group by $doc := util:document-name(root($par))
return <result doc="{$doc}">{$par}</result>

Search for <noun>soul</noun>: 48 results in 1 second:

for $par in collection("/db/books")//p[.//noun[ft:query(., "soul")]]
group by $doc := util:document-name(root($par))
return <result doc="{$doc}">{$par}</result>

Search for <adj>Paphlagonian</adj> and <noun>soul</noun> in the same paragraph: 1 result in 1 second:

for $par in collection("/db/books")//p[.//adj[ft:query(., "paphlagonian")] and .//noun[ft:query(., "soul")]]
group by $doc := util:document-name(root($par))
return <result doc="{$doc}">{$par}</result>

Search for <adj>full</adj>: 49 results in 800 ms:

for $par in collection("/db/books")//p[.//adj[ft:query(., "full")]]
group by $doc := util:document-name(root($par))
return <result doc="{$doc}">{$par}</result>
tobiasschweizer commented 4 years ago

So is it more efficient than Knora?

benjamingeer commented 4 years ago

Search for <adj>full</adj> and <noun>Euchenor</noun> in the same paragraph: query never terminates, uses 100% CPU, and results in an OutOfMemoryError:

for $par in collection("/db/books")//p[.//adj[ft:query(., "full")] and ..//noun[ft:query(., "Euchenor")]]
group by $doc := util:document-name(root($par))
return <result doc="{$doc}">{$par}</result>
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "db.exist.scheduler.quartz-scheduler_QuartzSchedulerThread"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Scanner-0"

The shutdown.sh script and Control-C have no effect. After stopping the process with kill -9, and restarting it, eXist displays a progress meter that stays stuck like this for a while:

Redo  [================================================= ] (98 %)=== ] (98 %)
benjamingeer commented 4 years ago

eXist then eventually restarts. Aha, now I see that there is a bug in my query (.. instead of .).

benjamingeer commented 4 years ago

Search for full and Euchenor in the same paragraph (fixed query): 1 result in 1.6 seconds:

for $par in collection("/db/books")//p[.//adj[ft:query(., "full")] and .//noun[ft:query(., "Euchenor")]]
group by $doc := util:document-name(root($par))
return <result doc="{$doc}">{$par}</result>

This is the query that was very slow in Knora here: https://github.com/dhlab-basel/knora-large-texts/issues/2#issuecomment-541052045

benjamingeer commented 4 years ago

So is it more efficient than Knora?

Yes, in this use case:

To make this efficient in eXist, you have to configure a Lucene index specifically for each tag that you want to search for. Even so, it's not lightning-fast: a typical query takes 1-2 seconds. But this seems OK to me.

Probably if I split up each book into many small fragments (a few pages each) and stored the texts in the triplestore that way, Gravsearch would be able to do these kinds of queries more efficiently. In practice, I think there are other good reasons for doing that (e.g. it makes it easier to display and edit the texts).

benjamingeer commented 4 years ago

If I split each book into small fragments (e.g. 1000 words each), I can make a structure where each Book resource has hundreds of links to BookFragment resources (via a hasFragment property), each with a seqnum. But then the API would need to provide a way to page through the values of hasFragment (sorted by seqnum) when getting the Book resource.

benjamingeer commented 4 years ago

Also, it takes a lot longer to import all this data into Knora than to import it into eXists. I'm importing the same books as in https://github.com/dasch-swiss/knora-api/issues/1570#issuecomment-571533227 into Knora (fragmented), and it looks like it's going to take about 3 hours.

benjamingeer commented 4 years ago

To make the import faster, I added a config option to prevent Knora from verifying the data after it's written.

benjamingeer commented 4 years ago

I did some tests with the books split into 1000-word fragments, and it doesn't make knora-api:matchInStandoff any faster, because if you search for a common word, the query still has to check all occurrences, whether they're in a single text value or in many text values. I think the only way to make it faster would be to create a custom Lucene index for each tag, as you can in eXist.

tobiasschweizer commented 4 years ago

Would the Lucene connector allow for more flexibility?

benjamingeer commented 4 years ago

Looking at that now.

benjamingeer commented 4 years ago

It looks like the GraphDB Lucene connector can't do this. It just indexes whatever strings you give it, but in this case we would want to index substrings at specific positions, and it doesn't seem to have that feature.

http://graphdb.ontotext.com/documentation/standard/lucene-graphdb-connector.html#adding-updating-and-deleting-data

The only way I can see to do this would be to store the substrings themselves in the standoff tags. But this would mean duplicating a lot of text in the triplestore, and would make importing data even slower.

benjamingeer commented 4 years ago

A drawback of XQuery is that I don't see any way to search within overlapping hierarchies. For example, given this document from one of our tests:

<?xml version="1.0" encoding="UTF-8"?>
<lg xmlns="http://www.example.org/ns1" xmlns:ns2="http://www.example.org/ns2">
 <l>
  <seg foo="x" ns2:bar="y">Scorn not the sonnet;</seg>
  <ns2:s sID="s02"/>critic, you have frowned,</l>
 <l>Mindless of its just honours;<ns2:s eID="s02"/>
  <ns2:s sID="s03"/>with this key</l>
 <l>Shakespeare unlocked his heart;<ns2:s eID="s03"/>
  <ns2:s sID="s04"/>the melody</l>
 <l>Of this small lute gave ease to Petrarch's wound.<ns2:s eID="s04"/>
 </l>
</lg>

I can't figure out any way to find the ns2:s containing key or Shakespeare.

benjamingeer commented 4 years ago

I think I found an eXist-db function for this:

http://exist-db.org/exist/apps/fundocs/view.html?uri=http://exist-db.org/xquery/util&location=java:org.exist.xquery.functions.util.UtilModule&details=true#get-fragment-between.4

benjamingeer commented 4 years ago

That function seems to be buggy:

https://github.com/eXist-db/exist/issues/2316

benjamingeer commented 4 years ago

More implementations here:

https://wiki.tei-c.org/index.php/Milestone-chunk.xquery

benjamingeer commented 4 years ago

In any case, I would expect this to be very slow, because you can't make a Lucene index for the content between two milestones, only for the content of an ordinary XML element.

benjamingeer commented 4 years ago

Full-text search in different open-source XML databases:

XML database Full-text search
eXist-db implementation-specific full-text search feature based on Lucene
BaseX W3C XQuery and XPath Full Text 1.0
Zorba W3C XQuery and XPath Full Text 1.0

This means that to support multiple XML databases, we would need to develop something like Gravsearch for XQuery, and generate the appropriate syntax for each database.

There is also an XQuery and XPath Full Text 3.0 spec, but I haven't found any implementations.

benjamingeer commented 4 years ago

MarkLogic is a commercial database server that supports both RDF and XML. You can mix SPARQL and XQuery in a single query:

https://docs.marklogic.com/guide/semantics/semantic-searches#id_77935

adamretter commented 4 years ago

@benjamingeer there is also a RDF+SPARQL plugin for eXist-db that integrates Apache Jena - https://github.com/ljo/exist-sparql

You might also be interested in FusionDB as an alternative to eXist-db; It is 100% API compatible with eXist-db.

lrosenth commented 4 years ago

Very interesting!

Thanks!

Lukas Rosenthaler

Von meinem iPad gesendet

Am 02.03.2020 um 12:45 schrieb Adam Retter notifications@github.com:



@benjamingeerhttps://github.com/benjamingeer there is also a RDF+SPARQL plugin for eXist-db that integrates Apache Jena - https://github.com/ljo/exist-sparql

You might also be interested in FusionDBhttps://www.fusiondb.com as an alternative to eXist-db; It is 100% API compatible with eXist-db.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/dasch-swiss/knora-api/issues/1570?email_source=notifications&email_token=ABJX3TCZBDFMWC2MUF3GBL3RFOS5VA5CNFSM4KDB76X2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENPFFIY#issuecomment-593384099, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABJX3TGRJ5S6OXLZJEHNMCDRFOS5VANCNFSM4KDB76XQ.