exasol / exasol-virtual-schema

Virtual Schema from Exasol to Exasol
MIT License
1 stars 2 forks source link

Fix documentation in dialects/exasol.md on ORDER BY #104

Closed PeggySchmidtMittenzwei closed 1 year ago

PeggySchmidtMittenzwei commented 1 year ago

Currently VSEXA User guide contains wrong information:

SELECT * FROM (<virtual-schema-query>) ORDER BY <criteria> [, ...]

Because Exasol SQL engine might push down the keyword ORDER BY to the foreign database accessed by the virtual-schema sub-query. To prevent the push-down users can add an additional clause ORDER BY FALSE inside the sub-query:

SELECT * FROM (<virtual-schema-query> ORDER BY FALSE) ORDER BY <criteria> [, ...]

The outer clause ORDER BY will then be evaluated in the Exasol database domain.

Please note that clause ORDER BY FALSE needs to be used exactly as specified, while variants like ORDER BY TRUE or ORDER BY 'a' will have no effect.

As @narmion says: alternatively in statement CREATE VIRTUAL SCHEMA you can add parameter EXCLUDED_CAPABILITIES='ORDER_BY_COLUMN, ORDER_BY_EXPRESSION', see Exasol documentation.

However both approaches still come with some limitations regarding collations and order by limit clauses.

Collations

A collation is a set of rules that defines how to compare and sort character strings.

The foreign database accessed by the virtual-schema-sub-query might support different collations while Exasol database only supports collations UTF-8 and ASCII.

When clause ORDER BY is not pushed-down but executed in the Exasol database domain then only the Exasol collation will be applied. This can change the order of the rows in the result set.

Order by limit

With clause ORDER BY LIMIT <n> the SQL statement returns max. only the first <n> rows ignoring all following rows of the result set.

Both approaches described above prevent the push-down of clause ORDER BY LIMIT inside the virtual-schema sub-query. In effect the sub-query then returns the complete result set which might be much larger than the specified limit <n> which can have significant effects on the volume of data transfer and the runtime of the query.

Acceptance Criteria

  1. Wrong passage in VSEXA dialect guide is replaced by a link to known limitations in official Exasol documentation on virtual schemas.
narmion commented 1 year ago

Another workaround would be to set EXCLUDED_CAPABILITIES='ORDER_BY_COLUMN, ORDER_BY_EXPRESSION' if the adapter supports it.

ckunki commented 1 year ago

Thanks to @PeggySchmidtMittenzwei for additional background.

What we could clarify is that there definitely are cases in which ORDER BY does not work as expected, i.e.

As far as I understand we do not have the confirmation, yet, there are other use cases in which ORDER BY works as expected.

Regarding the expected resolution a proper documentation is the minimum but preferred is to have an implementation that signals out of the box if ORDER BY is supported maybe accompanied with a proper explanation about the preconditions or covered use cases and potential exceptions, limitations, and restrictions.

ckunki commented 1 year ago

Additional information from @narmion:

ckunki commented 1 year ago

Marked as blocked until alignment with @redcatbear

ckunki commented 1 year ago

Based on the results in other tickets I propose to update VSEXA documentation as follows:

The protocol used by ExaLoader (the part that runs the IMPORT) and Exasol core database only supports unordered transfer. Therefore the outermost order of the imported result rows is not guaranteed.

If you need ordering, please wrap your query into an extra SELECT * FROM (<virtual-schema-query> ORDER BY FALSE) ORDER BY <criteria> [, ...] which will then be executed on the target Exasol database instead of the source of the virtual schema.

Please note that clause ORDER BY FALSE needs to be used exactly as specified, while variants like ORDER BY TRUE or ORDER BY 'a' will have no effect.

Additionally I want to state my current understanding of the impact of the described limitation of the protocol:

redcatbear commented 1 year ago

Okay from my side, if you verify the third statement in the code. Please check the implementation and remove "probably" if true.

ckunki commented 1 year ago

Current state of knowledge is that document-based virtual schemas do not support capability ORDER_BY, see VSD, VSDYN, hence clause ORDER BY can only be applied on top-level outside the virtual schema in the target Exasol database.

In general we should document this as known limitation in the official Exasol documentation, see proposal for DOC ticket below.

For VSEXA we should update file VSEXA dialects/exasol.md to point to known limitations in the official documentation

Other VS should also point to known limitations in the official documentation but as development capacity is limited as well, this is subject to prioritization.


DOC ticket proposal

Dear documentation team, triggered by a customer request we came across a limitation affecting most virtual schemas. In the current ticket therefore we want to request a change to our official documentation in order to inform our customers.

We propose to add the limitations in a prominent place (so that it's easy to find on the first glance), i.e. either on top-level page "Virtual schemas" or in a separate bullet item labeled "Known limitations" on top-level.

Here is a content proposal, please feel free to rephrase and Integration team is happy to review the final proposal.


Known limitations Clause ORDER BY

Clause ORDER BY can be used without limitations in all virtual schemas that do not use IMPORT (= ExaLoader):

Document-based virtual schemas do not support capability ORDER_BY_COLUMN at all.

For other virtual schemas using IMPORT, please note that the protocol used by ExaLoader (the component that runs the IMPORT) and Exasol core database only supports unordered transfer. Therefore the outermost order of the imported result rows is not guaranteed.

If you need ordering, please wrap your query into an extra SELECT * FROM (<virtual-schema-query> ORDER BY FALSE) ORDER BY <criteria> [, ...] which will then be executed on the target Exasol database instead of the source of the virtual schema.

ckunki commented 1 year ago

Created doc ticket, see Int-Ref.

ckunki commented 1 year ago

Strategy found, described and aligned - removing blocker, starting implementation.

ckunki commented 1 year ago

Updated documentation and dependencies, also removed redundant dependency to jakarta.json, ready for review