`.eq_any([["foo"]])` on `SqlType = Array<Text>` causes runtime error on Postgres

Ten0 commented 6 days ago

Setup

Versions

Diesel: ce6fa0bd93877391665254d9bdb23a9c895902ef
Database: Postgres

Feature Flags

diesel: postgres

Problem Description

array.eq_any(array_of_array) doesn't work on Postgres, because Postgres doesn't have such a thing as "array of array". It has 2D array but "2-D array is a different concept". This causes an error, so we should probably use the non-specialized ANSI IN (...) when dealing with arrays.

What are you trying to accomplish?

table! {
    posts (id) {
        id -> Int4,
        // ...
        tags -> Array<Text>,
    }
}

#[test]
#[cfg(feature = "postgres")]
fn filter_array_by_in() {
    use crate::schema::posts::dsl::*;

    let connection: &mut PgConnection = &mut connection();
    let tag_combinations_to_look_for: &[&[&str]] = &[&["foo"], &["foo", "bar"], &["baz"]];
    let result: Vec<i32> = posts
        .filter(tags.eq_any(tag_combinations_to_look_for))
        .select(id)
        .load(connection)
        .unwrap();
    assert_eq!(result, &[] as &[i32]);
}

What is the actual output?

called `Result::unwrap()` on an `Err` value: DatabaseError(Unknown, "could not find array type for data type text[]")

What is the expected output?

Either of:

test passes, serializing as IN ($1, $2, $3) with $1 = ["foo"]...
Doesn't compile, requiring an explicit writing (e.g. .eq_any(abc).ansi()) to tell it to use ANSI serialization

ATM AFAICT there's no way to even tell it to use ANSI serialization because of bounds here: https://github.com/diesel-rs/diesel/blob/ce6fa0bd93877391665254d9bdb23a9c895902ef/diesel/src/expression/array_comparison.rs#L100-L103 which forbids the use of the ANSI implementation on Pg Backend (which seems weird since ANSI should always work, no?)

(NB: All of this is useful only to the extent to which indexing will work and this behaves reasonably even if there are 500 different values in there. I recall that we did tend to prefer = ANY maybe because of this but I may be mistaken here, was it only for statement caching ?)

Steps to reproduce

https://github.com/Ten0/diesel/commit/d6c93fd2a8af99df9bc8dc362144e6f599c6030c

or dedicated working branch to fixing this:

cd diesel
git remote add Ten0 https://github.com/Ten0/diesel.git
git checkout array_eq_any
cd diesel_tests
DATABASE_URL="..." cargo test --no-default-features --features "postgres" -- filter_array_by_in --nocapture

Ten0 commented 6 days ago

= ANY (VALUES (ARRAY['foo']), (ARRAY['bar'])) works, so it looks like as a VALUES subquery in the Many specialization would also work (and maybe generate query plans that match better the = ANY (ARRAY) case but not sure)... The main question is: how to detect that you actually have an Array of Array. I thought about:

Using the type metadata (array_oid of array is zero): we can't ATM because that requires having a connection (for MetadataLookup), which we don't have for the ToSql AST pass (notably for debug reasons)
Adding that information to the SqlType trait. a. We could have that as a type like is done for IsNullable but that may not be worth the additional complexity (and/or breaking changes), considering that "associated type defaults are unstable (https://github.com/rust-lang/rust/issues/29661)"... but maybe that's fine because that is mostly (always?) derived anyway? b. It doesn't necessarily need to be a type though: even if it's just an associated const we could use it in the QueryFragment implementation. If it's just an associated const we can't use that to restrict compilation of additional array nesting though, because "associated const equality is incomplete (https://github.com/rust-lang/rust/issues/92827)", so if it's meant to actually be a type then we probably shouldn't do this...

Ten0 commented 6 days ago

A workaround that uses the VALUES (...), (...) writing:


use {
    diesel::{
        backend::Backend,
        expression::{
            array_comparison::{AsInExpression, MaybeEmpty},
            AsExpression, Expression, TypedExpressionType, ValidGrouping,
        },
        prelude::*,
        query_builder::{AstPass, QueryFragment, QueryId},
        serialize::ToSql,
        sql_types::{HasSqlType, SingleValue, SqlType},
    },
    std::marker::PhantomData,
};

/// Fixes `eq_any` usage with arrays of arrays:
/// https://github.com/diesel-rs/diesel/issues/4349
///
/// Usage: `column.eq_any(AsSingleColumnValueSubselect(iterator_of_arrays))`
pub struct AsSingleColumnValueSubselect<I>(pub I);
impl<I, T, ST> AsInExpression<ST> for AsSingleColumnValueSubselect<I>
where
    I: IntoIterator<Item = T>,
    T: AsExpression<ST>,
    ST: SqlType + TypedExpressionType,
{
    type InExpression = SingleColumnValuesSubselect<ST, T>;

    fn as_in_expression(self) -> Self::InExpression {
        SingleColumnValuesSubselect {
            values: self.0.into_iter().collect(),
            p: PhantomData,
        }
    }
}

#[derive(Debug, Clone)]
pub struct SingleColumnValuesSubselect<ST, I> {
    /// The values contained in the `= ANY ()` clause
    pub values: Vec<I>,
    p: PhantomData<ST>,
}

impl<ST, I, GB> ValidGrouping<GB> for SingleColumnValuesSubselect<ST, I>
where
    ST: SingleValue,
    I: AsExpression<ST>,
    I::Expression: ValidGrouping<GB>,
{
    type IsAggregate = <I::Expression as ValidGrouping<GB>>::IsAggregate;
}

// This implementation is fake, it's not an expression, but it's used to check consistency between the SQL types
// of both sides of the `= ANY ()` clause.
impl<ST, I> Expression for SingleColumnValuesSubselect<ST, I>
where
    ST: TypedExpressionType,
{
    type SqlType = ST;
}

impl<ST, I> MaybeEmpty for SingleColumnValuesSubselect<ST, I> {
    fn is_empty(&self) -> bool {
        self.values.is_empty()
    }
}

impl<ST, I, QS> SelectableExpression<QS> for SingleColumnValuesSubselect<ST, I>
where
    SingleColumnValuesSubselect<ST, I>: AppearsOnTable<QS>,
    ST: SingleValue,
    I: AsExpression<ST>,
    <I as AsExpression<ST>>::Expression: SelectableExpression<QS>,
{
}

impl<ST, I, QS> AppearsOnTable<QS> for SingleColumnValuesSubselect<ST, I>
where
    SingleColumnValuesSubselect<ST, I>: Expression,
    I: AsExpression<ST>,
    ST: SingleValue,
    <I as AsExpression<ST>>::Expression: SelectableExpression<QS>,
{
}

impl<ST, I, DB> QueryFragment<DB> for SingleColumnValuesSubselect<ST, I>
where
    DB: Backend + HasSqlType<ST>,
    ST: SingleValue,
    I: ToSql<ST, DB>,
{
    fn walk_ast<'b>(&'b self, mut out: AstPass<'_, 'b, DB>) -> QueryResult<()> {
        if self.values.is_empty() {
            return Err(diesel::result::Error::QueryBuilderError(
                "Cannot generate an empty `= ANY (VALUES (...), (...))` clause - avoid making a query in that case"
                    .into(),
            ));
        }
        out.unsafe_to_cache_prepared();
        out.push_sql("VALUES ");
        let mut first = true;
        for value in &self.values {
            if first {
                first = false;
            } else {
                out.push_sql(", ");
            }
            out.push_sql("(");
            out.push_bind_param(value)?;
            out.push_sql(")");
        }
        Ok(())
    }
}

impl<ST, I> QueryId for SingleColumnValuesSubselect<ST, I> {
    type QueryId = ();

    const HAS_STATIC_QUERY_ID: bool = false;
}

weiznich commented 6 days ago

I've pushed #4350 with another approach to fix that problem, but I'm open to different solutions.

As a more general note: I wonder if it would be meaningful (and possible) to have some sort of fuzzer that takes the documentation and keeps generating queries via the DSL and then checks if they compile and if they do not return an syntax error while executing them.

Ten0 commented 5 days ago

As a more general note: I wonder if it would be meaningful (and possible) to have some sort of fuzzer that takes the documentation and keeps generating queries via the DSL and then checks if they compile and if they do not return an syntax error while executing them.

This looks interesting. However I feel like it's probably fine if Diesel doesn't cover for mistakes people never make, and there ought to be some of those that the type system allows but the complexity of preventing isn't worth the probability of occurrence... If this is true, then I'm not sure what would be a strategy to create a general test system that would test only "relevant" cases but still be able to infer them automatically.

diesel-rs / diesel