xsl "document" function hangs when given an internal path to moderately sized XML file

paulmer commented 5 years ago

What is the problem

The transform:transform function called with an XSL transform that reads a document using the "document" function appears to bypass any eXist indexes if the document is referenced with a database path ("/db/.....") instead of an xmldb URL ("xmldb:exist///db/...."). This causes very poor performance and makes the database to appear to hang if the document is even moderately sized (5000-6000 nodes, 200KB)

What did you expect

I expected roughly the same response time evaluating the same XSL transformation on document("/db/a.xml") and document("xmldb:exist///db/a.xml"), but instead I see response times of 22 seconds vs. 85 milliseconds.

Describe how to reproduce or add a test

Start with an empty database (I deleted the data directory completely.)
Start the eXist client and connect to an embedded database.
Load the attached documents to create the collection /db/test containing data.xml and query.xqy. test.zip
Open query.xqy and edit the value of $xmldbPrefixOpt. Setting the value to 1 prepends "xmldb:exist://" to the file path that's used in the XSL transformation, setting it to any other value omits that prefix.
Run the script. It performs a simple query on data.xml to find an element with a specific xml:id attribute. The query is run twice, once through an XSL transform passed to transform:transform and once in native XQuery, and the number of nodes found as well as the path used for the document are returned.

On my system running with $xmldbPrefixOpt set to 1 completes in 0.1 to 1 second (depending on the cache state), while running with $xmldbPrefixOpt set to 2 takes about 22 seconds.

Context information

Please always add the following information eXist 5.0.0

Java version (e.g. Java8u121) 1.8.0_191
MacOS Sierra, Catalina, and RHEL 7 64 bit
How is eXist-db installed? JAR installer
Any custom changes in e.g. conf.xml: Altered sync time to 3 seconds, shutdown time to 4 seconds, added preserve-whitespace-mixed-content, disabled transformer caching and enabled full text indexing on attributes.

adamretter commented 5 years ago

Indexes are never used within XSLT. eXist-db uses the Saxon XSLT engine for XSLT, and Saxon is not aware of the eXist-db indexes. I imagine what you are seeing here is some sort of URI resolution problem.

paulmer commented 5 years ago

Wouldn't that still be an eXist (or a Saxon-eXist integration) problem? I'll look at my actual use case more carefully (cross-referencing between two documents in an XSL transformation that formats search results), but this came to my attention due to the extraordinarily poor performance in comparison to the much older version of eXist (2.2) I'm still running in production.

adamretter commented 5 years ago

@paulmer Yes, I think it is still a bug. I just wanted to point out that it's not related to indexes AFAIK.

paulmer commented 5 years ago

Got it, sorry, my misunderstanding. I pulled out a debugger and I see now that there's a linear traversal over the document that's excruciating. It's pulling only 25 - 50 nodes per second in my real-life case where there are about 130,000 nodes in the document, and it looks like about 90% of the time (based on logging timestamps, not any real profiling) is within the StoredNode.getNextSibling call chain. I guess there's an optimization opportunity there. :-)

joewiz commented 5 years ago

Sounds a bit like https://gitlab.existsolutions.com/tei-publisher/tei-publisher-app/issues/92#note_3687

paulmer commented 5 years ago

I have an idea to address this issue I'd like to run by all the experienced eXist developers.

I see that eXist is handing Saxon a DOMSource object with the document as the source root when it resolves an internal XML document (org.exist.xslt.EXistURIResolver.databaseSource(...)), and ultimately Saxon uses DOM calls to traverse the node graph. Somewhere in that call tree is where the performance penalty is arising.

But, I'm finding that processing the same document as the main XML document in a transformation is many orders of magnitude faster because the main document is read using a serializer's toSAX method. What would be the drawback from returning a customized SAXSource which creates an XMLReader that uses a serializer to process the document?

I hastily cobbled together a couple of test classes and it seems to work very well in terms of speed (my "hanging" test case is down to 300ms), but beyond the basics of creating a serializer and calling toSAX in the XMLReader's parse method, I suspect I'm missing something. Are there locking concerns, concurrency issues or other concerns that would make this method impractical?

adamretter commented 5 years ago

@paulmer Where do you see the DOMSource stuff? Perhaps it is easier to discuss on Slack? https://join.slack.com/t/exist-db/shared_invite/enQtNjQ4MzUyNTE4MDY3LWNkYjZjMmZkNWQ5MDBjODQ3OTljNjMyODkwNmY1MzQwNjUwZjMzZTY1MGJkMjY5NDFhOWZjMDZiMDdhMzY4NGY

paulmer commented 5 years ago

@adamretter The DOMSource is created and returned on lines 211-213 of exist-core/src/main/java/org/exist/xslt/EXistURIResolver.java. I've joined the eXist slack workspace, so feel free to message me there.

eXist-db / exist