Open paulmer opened 5 years ago
Indexes are never used within XSLT. eXist-db uses the Saxon XSLT engine for XSLT, and Saxon is not aware of the eXist-db indexes. I imagine what you are seeing here is some sort of URI resolution problem.
Wouldn't that still be an eXist (or a Saxon-eXist integration) problem? I'll look at my actual use case more carefully (cross-referencing between two documents in an XSL transformation that formats search results), but this came to my attention due to the extraordinarily poor performance in comparison to the much older version of eXist (2.2) I'm still running in production.
@paulmer Yes, I think it is still a bug. I just wanted to point out that it's not related to indexes AFAIK.
Got it, sorry, my misunderstanding. I pulled out a debugger and I see now that there's a linear traversal over the document that's excruciating. It's pulling only 25 - 50 nodes per second in my real-life case where there are about 130,000 nodes in the document, and it looks like about 90% of the time (based on logging timestamps, not any real profiling) is within the StoredNode.getNextSibling call chain. I guess there's an optimization opportunity there. :-)
I have an idea to address this issue I'd like to run by all the experienced eXist developers.
I see that eXist is handing Saxon a DOMSource
object with the document as the source root when it resolves an internal XML document (org.exist.xslt.EXistURIResolver.databaseSource(...)
), and ultimately Saxon uses DOM calls to traverse the node graph. Somewhere in that call tree is where the performance penalty is arising.
But, I'm finding that processing the same document as the main XML document in a transformation is many orders of magnitude faster because the main document is read using a serializer's toSAX
method. What would be the drawback from returning a customized SAXSource
which creates an XMLReader
that uses a serializer to process the document?
I hastily cobbled together a couple of test classes and it seems to work very well in terms of speed (my "hanging" test case is down to 300ms), but beyond the basics of creating a serializer and calling toSAX
in the XMLReader
's parse method, I suspect I'm missing something. Are there locking concerns, concurrency issues or other concerns that would make this method impractical?
@paulmer Where do you see the DOMSource
stuff? Perhaps it is easier to discuss on Slack? https://join.slack.com/t/exist-db/shared_invite/enQtNjQ4MzUyNTE4MDY3LWNkYjZjMmZkNWQ5MDBjODQ3OTljNjMyODkwNmY1MzQwNjUwZjMzZTY1MGJkMjY5NDFhOWZjMDZiMDdhMzY4NGY
@adamretter The DOMSource is created and returned on lines 211-213 of exist-core/src/main/java/org/exist/xslt/EXistURIResolver.java. I've joined the eXist slack workspace, so feel free to message me there.
What is the problem
The
transform:transform
function called with an XSL transform that reads a document using the "document" function appears to bypass any eXist indexes if the document is referenced with a database path ("/db/.....") instead of an xmldb URL ("xmldb:exist///db/...."). This causes very poor performance and makes the database to appear to hang if the document is even moderately sized (5000-6000 nodes, 200KB)What did you expect
I expected roughly the same response time evaluating the same XSL transformation on
document("/db/a.xml")
and document("xmldb:exist///db/a.xml"), but instead I see response times of 22 seconds vs. 85 milliseconds.Describe how to reproduce or add a test
Start with an empty database (I deleted the data directory completely.)
Start the eXist client and connect to an embedded database.
Load the attached documents to create the collection /db/test containing data.xml and query.xqy. test.zip
Open query.xqy and edit the value of
$xmldbPrefixOpt
. Setting the value to 1 prepends "xmldb:exist://" to the file path that's used in the XSL transformation, setting it to any other value omits that prefix.Run the script. It performs a simple query on data.xml to find an element with a specific xml:id attribute. The query is run twice, once through an XSL transform passed to transform:transform and once in native XQuery, and the number of nodes found as well as the path used for the document are returned.
On my system running with
$xmldbPrefixOpt
set to 1 completes in 0.1 to 1 second (depending on the cache state), while running with$xmldbPrefixOpt
set to 2 takes about 22 seconds.Context information
Please always add the following information eXist 5.0.0