SQL syntax: `UnsupportedFeatureException[Cannot ORDER BY '<expression>': invalid return type 'undefined'.]`

amotl commented 1 year ago

About

The test case test_search_runs_keep_all_runs_when_sorting fails when invoking search_runs with an order_by parameter.

run_results = self.store.search_runs(
    [experiment_id], None, ViewType.ALL, max_results=1000, order_by=["tag.t1"]
)

Details

The root cause, which generates the SQL statement outlined below, is this elaborate code within MLflow's mlflow.store.tracking.sqlalchemy_store._get_orderby_clauses:

# sqlite does not support NULLS LAST expression, so we sort first by
# presence of the field (and is_nan for metrics), then by actual value
# As the subqueries are created independently and used later in the
# same main query, the CASE WHEN columns need to have unique names to
# avoid ambiguity
if SearchUtils.is_metric(key_type, "="):
    case = sql.case(
        # Ideally the use of "IS" is preferred here but owing to sqlalchemy
        # translation in MSSQL we are forced to use "=" instead.
        # These 2 options are functionally identical / unchanged because
        # the column (is_nan) is not nullable. However it could become an issue
        # if this precondition changes in the future.
        (subquery.c.is_nan == sqlalchemy.true(), 1),
        (order_value.is_(None), 2),
        else_=0,
    ).label(f"clause_{clause_id}")

else:  # other entities do not have an 'is_nan' field
    case = sql.case((order_value.is_(None), 1), else_=0).label(f"clause_{clause_id}")
clauses.append(case.name)
select_clauses.append(case)
select_clauses.append(order_value)

Exception

UnsupportedFeatureException[Cannot ORDER BY 'case(true, $2, (value IS NULL), $1)': invalid return type 'undefined'.]
E               [SQL: SELECT DISTINCT runs.run_uuid, runs.name, runs.source_type, runs.source_name, runs.entry_point_name, runs.user_id, runs.status, runs.start_time, runs.end_time, runs.deleted_time, runs.source_version, runs.lifecycle_stage, runs.artifact_uri, runs.experiment_id, CASE WHEN (anon_1.value IS NULL) THEN ? ELSE ? END AS clause_1, anon_1.value
E               FROM runs LEFT OUTER JOIN (SELECT tags.key AS key, tags.value AS value, tags.run_uuid AS run_uuid
E               FROM tags
E               WHERE tags.key = ?) AS anon_1 ON runs.run_uuid = anon_1.run_uuid
E               WHERE runs.experiment_id IN (?) AND runs.lifecycle_stage IN (?, ?) ORDER BY clause_1, anon_1.value, runs.start_time DESC, runs.run_uuid
E                LIMIT ? OFFSET ?]
E               [parameters: (1, 0, 't1', '147893387621389', 'active', 'deleted', 1000, 0)]
E               (Background on this error at: https://sqlalche.me/e/20/f405)

Intermediate query

This is the query in intermediate form while being created and processed by SQLAlchemy.

SELECT DISTINCT runs.run_uuid, runs.name, runs.source_type, runs.source_name, runs.entry_point_name, runs.user_id, runs.status, runs.start_time, runs.end_time, runs.deleted_time, runs.source_version, runs.lifecycle_stage, runs.artifact_uri, runs.experiment_id, CASE WHEN (anon_1.value IS NULL) THEN :param_1 ELSE :param_2 END AS clause_1, anon_1.value
FROM runs LEFT OUTER JOIN (SELECT tags.key AS key, tags.value AS value, tags.run_uuid AS run_uuid
FROM tags
WHERE tags.key = :key_1) AS anon_1 ON runs.run_uuid = anon_1.run_uuid
WHERE runs.experiment_id IN (__[POSTCOMPILE_experiment_id_1]) AND runs.lifecycle_stage IN (__[POSTCOMPILE_lifecycle_stage_1]) ORDER BY clause_1, anon_1.value, runs.start_time DESC, runs.run_uuid
 LIMIT :param_3 OFFSET :param_4

Remarks

_No workaround for this has been found so far, so, most probably, the test case will be skipped for now. However, depending on needs, using MLflow's search_runs feature, together with sorting, may be an important feature we would not like to skip. So, I will be happy for any support here._

amotl commented 1 year ago

Analysis

I think we need to find a different solution for this particular line of code, which generates an SQL statement CrateDB does not understand.

SQLAlchemy

case = sql.case((order_value.is_(None), 1), else_=0).label(f"clause_{clause_id}")

SQL

CASE WHEN (anon_1.value IS NULL) THEN :param_1 ELSE :param_2 END AS clause_1

Thoughts

We will need to evaluate if we need to patch MLflow, or if we could enhance the CrateDB SQLAlchemy dialect in some way that it would skip producing such SQL clauses, when applicable.
Actually, I am not yet sure if the CASE WHEN is the culprit here, or just using that kind of statement within an ORDER BY clause on behalf of clause_1.

/cc @hlcianfagna, @seut

amotl commented 1 year ago

I found an application-based monkeypatch solution for this, essentially intercepting the return values of _get_orderby_clauses, and removing the corresponding CASE WHEN statement clauses. The test case test_search_runs_keep_all_runs_when_sorting succeeds now, this is all I was aiming at here.

First, I tried to use a generic solution by overwriting SQL Compiler's visit_case visitor method. While this was able to remove the CASE WHEN clause completely, so this is in general a feasible approach, the MLflow application logic at this place defies that attempt, because it adds the corresponding statement clause item to two different lists, coupling its bookkeeping stronger to subsequent application logic.

clauses.append(case.name)
select_clauses.append(case)

Because of this, there was no way to work around this pitfall by exclusively using SQLAlchemy-based workarounds. We need to patch MLflow instead.

amotl commented 1 year ago

Removing that clause from the query generator obviously had to have negative side effects. Why would the code otherwise have been there on the first hand?

It is failing the test_order_by_attributes test case, at this spot where the result order changed, with respect to None values.

E       AssertionError: assert ['None', '789', '456', '234', '123', '-123'] == ['789', '456', '234', '123', '-123', 'None']
E         At index 0 diff: 'None' != '789'
E         Full diff:
E         - ['789', '456', '234', '123', '-123', 'None']
E         ?                                    --------
E         + ['None', '789', '456', '234', '123', '-123']
E         ?  ++++++++

tests/test_tracking.py:1568: AssertionError

We are working around this by adding code-based sorting logic, in order to compensate for the missing SQL clause which is responsible for properly sorting None and NaN values to the end of the list.

amotl commented 1 year ago

The corresponding patch to omit the offending SQL clause looks like this.

https://github.com/crate-workbench/mlflow-cratedb/blob/34a2b32f4ce9e02d36bf22938cb69e12de9eb5d3/mlflow_cratedb/patch/mlflow/tracking.py#L35-L63

This one is needed to compensate that by code instead of SQL.

https://github.com/crate-workbench/mlflow-cratedb/blob/34a2b32f4ce9e02d36bf22938cb69e12de9eb5d3/mlflow_cratedb/patch/mlflow/tracking.py#L66-L157

amotl commented 1 year ago

Maybe @hammerhead or @hlcianfagna can find a workaround how to formulate the query outlined in the original post using CrateDB?

hlcianfagna commented 1 year ago

@amotl I tested with 5.4.2 and I am not seeing the UnsupportedFeatureException with this query, were you testing with an earlier or newer version by any chance?

amotl commented 1 year ago

Dear Hernan,

thanks for looking into this. I think I used CrateDB Nightly. Indeed, CrateDB supports CASE WHEN ... THEN ... END, I must have overlooked this detail.

Hm, maybe the error message indicates that, while it works in general, the engine can not provide sorting on an alias of that type of clause?

CASE WHEN (anon_1.value IS NULL) THEN ? ELSE ? END AS clause_1

ORDER BY clause_1

With kind regards, Andreas.

seut commented 1 year ago

@amotl CrateDB does support this:

cr> select CASE WHEN (1 IS NULL) THEN 'yes' ELSE 'no' END AS clause_1 order by clause_1;                                                                                                                                                                 
+----------+
| clause_1 |
+----------+
| no       |
+----------+
SELECT 1 row in set (0.015 sec)

I highly recommend to always try the SQL out to verify that it's not the query in general ;)

amotl commented 1 year ago

Interesting, thanks. Then, some other combination of the SQL makes this statement fail somehow. I will re-evaluate the situation by using the plain SQL statement, as you suggested, in order to find out further details.

amotl commented 1 year ago

Is it eventually related to https://github.com/crate/crate/issues/15029?

crate / mlflow-cratedb