elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.3k stars 24.54k forks source link

SQL array handling is inconsistent #33204

Open paulcarey opened 6 years ago

paulcarey commented 6 years ago

The following applies to Elasticsearch 6.3.

Selecting array fields is not handled consistently and should be revised.

If a document contains an array of primitive (e.g. string, number), an attempt to project that field will fail with an error like Arrays (returned by [values]) are not supported. This behaviour is unnecessarily restrictive. There may be a misconception over SQL or JDBC array support being non-standard.

Wikipedia states that SQL support for arrays was standardized in 1999. JDBC support for Array has existed since JDK 1.2 (released 1998).

For reference, here's an example of using Arrays of Structs with JDBC and PostgreSQL. The Java tutorial has an example showing JDBC usage of Array of String.

A workaround for the array limitation in ES-SQL exists by wrapping the primitive values in an object containing just that value. When this is done then a select can be used to 'explode' each element in the array into its own row. While useful, this should arguably not happen by default.

This exploding behaviour is also supported in Hive, Spark SQL and some other data stores where it must be explicitly triggered by invoking lateral view explode.

If ES-SQL supported arrays, then:

The current situation feels a bit ill-defined.

Showing the current limitation

DELETE array_test_2

POST array_test_2/_doc
{
  "name": "foo",
  "values": [1, 2, 3]
}

POST _xpack/sql?format=txt
{
  "query": "select name, values from array_test_2"
}

{
  "error": {
    "root_cause": [
      {
        "type": "sql_illegal_argument_exception",
        "reason": "Arrays (returned by [values]) are not supported"
      }
    ],
    "type": "sql_illegal_argument_exception",
    "reason": "Arrays (returned by [values]) are not supported"
  },
  "status": 500
}

Workaround for the array limitation by wrapping primitives in an object

DELETE array_test_1

PUT array_test_1
{
  "mappings": {
    "_doc": {
      "properties": {
        "name": {
          "type": "text"
        },
        "values": {
          "type": "nested",
          "properties": {
            "v": {
              "type": "long"
            }
          }
        }
      }
    }
  }
}

POST array_test_1/_doc
{
  "name": "foo",
  "values": [{"v": 1}, {"v": 2}, {"v": 3}]
}

POST _xpack/sql?format=txt
{
  "query": "select name, values.v as v from array_test_1"
}

     name      |       v       
---------------+---------------
foo            |3              
foo            |2              
foo            |1              
elasticmachine commented 6 years ago

Pinging @elastic/es-search-aggs

costin commented 6 years ago

Thanks for the detailed (and formatted) issue - it's really appreciated.

The root issue with arrays comes from ES - the mapping doesn't indicate whether a field is an array (has multiple values) or not so without reading all the data, ES-SQL cannot know whether a field is a single or multi value.

As you pointed, one way to handle that is through explicit statements though each (including LATERAL VIEW EXPLORE) comes with its own semantics that require some thinking through.

Further more, arrays and struct have their own set of problems; I'm aware of the standard and JDBC support however de-facto, arrays are not widely available (see mysql and msserver) and not all consumers (a large part as far as I could tell) support them. The fact that it's a composite value breaks the flat table structure and challenges it - are the arrays enforced to have the same type, length? Are offsets supported (for display or querying)? One could argue that without introspection of an array, one can just treat it as a blob.

Getting the semantics right (and the implications of supporting such a type) requires a significant amount of time so it will take a while for us to get to it and sort it out.

paulcarey commented 5 years ago

Many thanks for the throughtful response.

While exploring ES-SQL support for nested documents, I came across a few other issues:

I think the dot notation used to query and project nested docs, while elegant, is simply not expressive enough. I'm sure there are a number of ways of addressing this. For example, https://github.com/json-path/JsonPath would allow multiple predicates to be applied to the same nested doc.

But given that an implicit mapping exists from a given document to its relational equivalent, perhaps the most predictable approach to supporting arbitrarily complex SQL is to require that such queries are expressed using 'normalized' queries.

For example, this document (that I referred to in this post)

{
  "groupName" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

has the following implicit schema

create table groups (
  group_id integer primary key,
  group_name varchar(20)
);

create table users (
  user_id integer primary key,
  group_id integer,
  first_name varchar(20),
  last_name varchar(20)
);

A query allowing multiple predicates to be expressed against a single nested doc would be trivially expressed as follows

select g.group_name from groups g
natural join users u
where u.first_name = 'alice'
and u.last_name = 'white';

This would be translated into the following Query DSL.

{                                                                                                                                                                                                                                        
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "alice" }},
            { "match": { "user.last":  "white" }}
          ]
        }
      }
    }
  }
}

There are multiple advantages to this approach:

What's notably missing from this approach relative to ES-SQL today is a way to specify the index against which the query should run, currently supported with from <tablename>. But this could be easily resolved by specifying the target database in the JDBC connection string, or as an extra param alongside query when submitting SQL queries over HTTP.

I'll add a follow-on comment with a few more examples that appear tricky to express with ES-SQL, but where the translation from 'normalized' SQL to Query DSL seems fairly mechanical.

paulcarey commented 5 years ago

Sample queries referred to in the comment above.

DELETE countries

PUT countries
{
  "mappings": {
    "_doc": {
      "properties": {
        "cities": {
          "type": "nested",
          "properties": {
            "districts": {
              "type": "nested",
              "properties": {
                "name": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "schools": {
                  "type": "nested",
                  "properties": {
                    "name": {
                      "type": "text",
                      "fields": {
                        "keyword": {
                          "type": "keyword",
                          "ignore_above": 256
                        }
                      }
                    },
                    "students": {
                      "type": "long"
                    }
                  }
                }
              }
            },
            "name": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        },
        "name": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

POST countries/_doc
{
  "name": "ireland",
    "cities": [
    {
      "name": "dublin",
      "districts": [
        {
          "name": "dun laoghaire",
          "schools": [
            {
              "name": "tao nan school",
              "students": 101
            },
            {
              "name": "colaiste eoin",
              "students": 202
            }
          ]
        },
        {
          "name": "howth",
          "schools": [
            {
              "name": "colaiste eoin",
              "students": 30
            }
          ]
        }
      ]
    },
    {
      "name": "cork",
      "districts": [
        {
          "name": "ballincollig",
          "schools": [
            {
              "name": "scoil san treasa",
              "students": 50
            }
          ]
        }
      ]
    }
  ]
}

# select d.districtName from districts d
# natural join schools s
# where s.schoolName = 'colaiste eoin'
# and s.students = 202;

# No way to require that the predicates refer to the same school
POST _xpack/sql?format=txt
{
  "query": "select countries.cities.districts.name from countries where countries.cities.districts.schools.name = 'colaiste eoin' and countries.cities.districts.schools.students = 202"
}

GET countries/_search
{
  "query": {
    "nested": {
      "path": "cities.districts",
      "query": {
        "nested": {
          "path": "cities.districts.schools",
          "query": {
            "bool": {
              "must": [
                {"match" : {"cities.districts.schools.name": "colaiste eoin"}},
                {"match" : {"cities.districts.schools.students": 202}}        
              ]
            }
          },
          "inner_hits": {}
        }
      },
      "inner_hits": {}
    }
  }
}

# select d.districtName, s.students from districts d
# natural join schools s
# where s.schoolName = 'colaiste eoin'
# and d.districtName = 'howth';

# Throws an NPE (possibly fixed in https://github.com/elastic/elasticsearch/issues/33170)
POST _xpack/sql?format=txt
{
    "query": "select countries.cities.districts.name, countries.cities.districts.schools.students from countries where countries.cities.districts.schools.name = 'colaiste eoin' and countries.cities.districts.name = 'howth'"
}

GET countries/_search
{
  "query": {
    "nested": {
      "path": "cities.districts",
      "query": {
        "bool": {
          "must": [
            {"match_phrase" : {"cities.districts.name": "howth"}},
            {
              "nested": {
                "path": "cities.districts.schools",
                "query": {
                  "match": {
                    "cities.districts.schools.name": "colaiste eoin"
                  }
                },
                "inner_hits": {}
              }
            }
          ]
        }
      },
      "inner_hits": {}
    }
  }
}

# select c.cityName from cities c
# natural join districts d
# natural join schools s
# where s.schoolName = 'colaiste eoin'
# and s.students = 30;

# Returns cork and dublin rather than just dublin, but possibly this is due to a bug in the way that inner_hits is generated
POST _xpack/sql?format=txt
{
    "query": "select countries.cities.name from countries where countries.cities.districts.schools.name = 'colaiste eoin'"
}

GET countries/_search
{
  "query": {
    "nested": {
      "path": "cities",
      "query": {
        "nested": {
          "path": "cities.districts",
          "query": {
            "nested": {
              "path": "cities.districts.schools",
              "query": {
                "bool": {
                  "must": [
                    {
                      "match": 
                        {"cities.districts.schools.name": "colaiste eoin"}
                    },
                    {
                      "match": 
                        {"cities.districts.schools.students": 30}
                    }
                  ]
                }
              },
              "inner_hits": {}
            }
          },
          "inner_hits": {}
        }
      },
      "inner_hits": {}
    }
  }
}

Incidentally, I've tested all the SQL above using PostgreSQL 9.6 on http://sqlfiddle.com/ using the following setup.

create table countries (
  country_id integer primary key,
  countryName varchar(20)
);

create table cities (
  city_id integer primary key,
  country_id integer,
  cityName varchar(20)
);

create table districts (
  district_id integer primary key,
  city_id integer,
  districtName varchar(20)
);

create table schools (
  school_id integer primary key,
  district_id integer,
  schoolName varchar(20),
  students integer
);

insert into countries values ( 1, 'ireland');
insert into cities    values ( 1, 1, 'dublin');
insert into cities    values ( 2, 2, 'cork');
insert into districts values ( 1, 1, 'dun laoghaire');
insert into districts values ( 2, 1, 'howth');
insert into districts values ( 3, 2, 'ballincollig');
insert into schools   values ( 1, 1, 'tao nan school', 101);
insert into schools   values ( 2, 1, 'colaiste eoin', 202);
insert into schools   values ( 3, 3, 'scoil san treasa', 50);
insert into schools   values ( 4, 2, 'colaiste eoin', 30);
costin commented 5 years ago

@paulcarey, thank you for providing valuable input and being generous with your ideas - it is appreciated!

I agree with you that mapping nested docs (or parent/join) relationship, which is a 1:N, as two separate tables brings a number of benefits since the association between parent/nested becomes obvious and a lot of concepts fall into place. The concern with this approach though, is that only a small subset of JOIN functionality is used and just on a particular set of data, which would be misleading and unexpected to a large number of users. It might be also interpreted as if nested/children docs exist by themselves (are a separate table) when in fact they aren't. This is further accentuated by the fact that ES currently lacks the ability to easily paginate or return nested docs (inner_hits for example requires one to specify how many docs to return instead of allowing all docs (which vary in size across docs) to be returned). This complicates the implementation and in some cases, what SQL can effectively do.

Unfortunately at this stage, due to lack of time mainly, I don't have a clear answer on how we'll move forward with nested docs. Note this is a separate issue from arrays - it does conceptually stems from the same abstraction (mapping one-to-many relationships) however the data type/container is significantly different.

Back to arrays, a simple solution for them is introducing either a FLATTEN or ARRAY function - the former as a way to unwind an array (need to sort out the table structure), the latter to return a field as an array without changing the table structure (as in number of columns/rows).

paulcarey commented 5 years ago

Apologies for taking the discussion a little off-topic.

I take your point about not wanting to surprise users. Representing nested object as fully-fledged tables would indeed be misleading, but perhaps this could be mitigated with syntax e.g. by fully qualifying table names (select ... from countries.cities.districts d) while still allowing natural joins against such tables.

I'm new to ES so don't really have a view on the difficulty of effectively retrieving nested docs via inner_hits, or of supporting this via the search & scroll API.

Anyway, I think I'm done with suggestions :smile: But I will watch the development of ES-SQL with interest as I'll almost certainly be adopting this in my company. Our users will need to query on nested documents, and in the near time I think this will be done with a visual query builder. I note that Kibana doesn't support such query building for nested documents, (possibly because it's difficult in the general case) but I think we have enough information about our document structure to make this feasible.

The proposal for FLATTEN and ARRAY functions sounds sensible. Many thanks.

paulcarey commented 5 years ago

Hi, I was wondering if there's been any progress on this issue? Thanks.

astefan commented 5 years ago

Hi @paulcarey, There is no "visible" progress on the nested documents front. We did have an internal discussion about the technical limitations around fully supporting the nested documents and some new ideas have arisen, but probably in the near future there won't be any work on this.

paulcarey commented 5 years ago

That's a bit disappointing as it's about nine months since this issue was first raised.

At this point I'm heavily committed to both nested documents and ES SQL. I realise no commitments had been provided, but stating that there probably won't be any work on this in the near future is not reassuring.

Regarding the new ideas that arose in your internal discussions, are these something you could share?

Are there any workarounds that could be adopted? For example, by having an intermediate query layer invoke translate with the client SQL and then perform some extra translation on the resulting Query DSL, provided that the query layer has some understanding of the document structure. Even if a workaround is not truly general, it may well be better than simply saying that ES SQL and nested docs are largely incompatible. Thanks.

FreCap commented 3 years ago

@paulcarey I'm wondering (I know it has been a while), if any workaround was found.

@astefan I saw that some work on this has started, which is great! Is it possible to share any public issue on the expected results?

I'm new to x-pack SQL, so I might be missing something, but I've been working quite a while with opendistro SQL so I might bring a different perspective.

Opendistro worked part of this out with a similar approach to Postgres JSONB --> just made it a field and print JSONs without exploding it https://github.com/opendistro-for-elasticsearch/sql

1) SELECT myNested --> return the JSON without flatting it out (since flatting an array is not very ideal)

2) SELECT nested(myNested) FROM table-xyz WHERE nested(myNested, myNested.field="firstChoice") --> will show a an array that refers to the nested(myNested) as a field in one single row

If we think in SQL this would probably look like (supported in opendistro) SELECT AGGREGATE(m) FROM table-xyz, table-xyz.myNested m WHERE m.field="firstChoice" GROUP BY table-xyz._id

In Postgres with jsonb it would be

SELECT t.id, array_to_json(array_agg(myNested))
FROM table-xyz t, json_array_elements(myNested.jsonColumn) myNested
WHERE myNested->>'field' = 'firstChoice'
GROUP BY id;

3) We could also have multiple joins such as (multiple FROM tables ARE NOT supported in opendistro, but multiple nested are, except for the fact that you can use only one of the two nested in the select): SELECT m1, m2 FROM table-xyz, table-xyz.myNested m1, table-xyz.myNested m2 WHERE m1.field="firstChoice" AND m2.field="secondChoice"

or

SELECT AGGREGATE(m1), AGGREGATE(m2) FROM table-xyz JOIN myNested m1 JOIN myNested m2 WHERE m1.field="firstChoice" AND m2.field="secondChoice" GROUP BY table-xyz._id

costin commented 3 years ago

@FreCap thanks for the detailed comment and perspective.

There are two topics on this ticket that are being mixed which makes it hard to follow.

Which is what this ticket is mainly about and something that we're investigating. The main challenge on this front has been around filtering and the semantics associated with multi-value fields in ES which different than SQL arrays (which are fixed in size).

I realize it's disappointing to folks (in particular @paulcarey) there hasn't been a lot of update on this front. As I've explained in my previous comments, a number of technical challenges impose limitations on handling nested data, such as paging through results with non-trivial amount of docs.

The team hasn't forgotten about this issue and as we're making progress on the array front, hopefully we'll have some updates to share here as well.

Thanks everyone for the patience.

serak commented 3 years ago

how is this problem solved in mongodb ?. in mongodb sql like select * from customer where address.city='Berlin' works where it works like a self join with address table id 1 | address.region1 1 | address.region2 1 | address.region3 ...

emgreen33 commented 1 year ago

👋🏼 hello, read the thread/s on this issue, but saw this one was still open. Running into the same issue (sorry, broken record). It's a bit awkward as the field on the index can return a string, null or an array (its own issue), but, trying to implement a safeguard for these. I have a SQL hack to get around it, but, curious if anyone else found an organic workaround?

AydinChavez commented 1 year ago

@FreCap thanks for the detailed comment and perspective.

There are two topics on this ticket that are being mixed which makes it hard to follow.

  • Arrays (aka multi-value fields)

Which is what this ticket is mainly about and something that we're investigating. The main challenge on this front has been around filtering and the semantics associated with multi-value fields in ES which different than SQL arrays (which are fixed in size).

  • Nested fields

I realize it's disappointing to folks (in particular @paulcarey) there hasn't been a lot of update on this front. As I've explained in my previous comments, a number of technical challenges impose limitations on handling nested data, such as paging through results with non-trivial amount of docs.

The team hasn't forgotten about this issue and as we're making progress on the array front, hopefully we'll have some updates to share here as well.

Thanks everyone for the patience.

Patience is an interpretable term, I do really appreciate all your work but on the other side @paulcarey has initiated this quite essential need (from elastic/analytics context) five years ago.

Question (maybe I am overlooking something obvious here): What kind of options do we have in order to query elastic indices (e.g. via sql) having array fields (in case excluding the array fields from the statement is not an option)?

I am aware about the field_multi_value_leniency flag, which is not even a full workaround because according to the doc it simply shows only the first array entry. Is there even no possibility to do explicitly casting an array field simply to a varchar/string so arrays are just getting printed out (for analytics reasons)?

elasticsearchmachine commented 7 months ago

Pinging @elastic/es-analytical-engine (Team:Analytics)

sarcoex commented 1 month ago

Hello @emgreen33 I am curious to know this SQL hack, I still can't find how to access the array fields and have not option to split the index or change the mapping