elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.61k stars 24.64k forks source link

SQL: Default configuration and whether it is mutable #29842

Closed elasticmachine closed 6 months ago

elasticmachine commented 6 years ago

Original comment by @costin:

While working on the mapping I've encountered a number of situations where, depending on the user preference, we could take different actions. Through this ticket I'd like to track the options and their values and also discuss whether we want/can/should introduce the ability to read/write these options for alpha or not (LINK REDACTED).

Currently we have the following client params:

  1. query.timeout - 90s How long to wait for the initial query to return results
  2. page.timeout - 45s How long to wait for the next query page to return
  3. page.size - 1000 How many results to be returned at once/in one page

I'd like to also introduce (naming tbd - I like having some namespace yet I'd like to keep them small):

  1. mapping.ignore.unsupported= true Whether unsupported mappings are ignored or an exception is thrown. Think geo and such. I prefer ignore since it's lenient and provides a nicer experience since it doesn't force a reindex just to use it.

  2. mapping.array.leniency = true How should we deal with arrays? Throw an exception or be lenient and return the first value (which is what we currently do).

  3. query.max.nested.docs = 99 how many nested docs to be returned per parent. After talking to MvG it seems that currently there's no way to tell ES to automatically return all docs and we'll have to settle for a number until this feature makes its way in. This translates to the size of a priority queue so we should be somewhat conservative.

  4. run.as = See LINK REDACTED

Ideally these options would be modifiable but for alpha we can just lock them in and postpone the associated work post release.

Thoughts?

elasticmachine commented 6 years ago

Original comment by @costin:

Pinging @nik9000 @imotov @dakrone

elasticmachine commented 6 years ago

Original comment by @dakrone:

Whether unsupported mappings are ignored or an exception is thrown. Think geo and such.

Ignored at what level? For MATCH() query? Returning as part of SELECT?

How should we deal with arrays? Throw an exception or be lenient and return the first value (which is what we currently do).

How do other SQL servers handle things like this? (Just asking out of curiosity)

query.max.nested.docs = 99

Does this map to an inner_hits call? Also, why 99 and not 100?

elasticmachine commented 6 years ago

Original comment by @costin:

By the way, we can try and sort this on a call depending on everyone's availability.

Ignored at what level?

Ignored at mapping level - meaning how we understand the index regardless of the actions applied to it. The moment we try to resolve the index and read its mapping, either we blow up and chose to ignore the field as if it doesn't exist. This affects all operations on that index.

How do other SQL servers handle things like this?

Arrays are not standard SQL; some DBs support (Postgres) them some (Mysql, SQL Server) do not. The main reason being that SQL is already tabular and an array is basically a subset of a table.

As far as I'm aware, an array is just a white-box struct, where one can indicate an element from: SELECT array FROM foo WHERE array[1]=bar Also in general arrays are discouraged as one can simply break it down into different columns or keep it as a whole and use a function (e.g. split) to work with it. I believe arrays also cannot be indexed (in RDBMs).

This also implies that arrays are fixed and of the same type. The issue we're having is not only it's fairly hard for us to support the array concept, we'd have to use it for all fields since basically any primitive field in ES can be an array.

Does this map to an inner_hits call?

Yes, it does.

Also, why 99 and not 100?

Just an arbitrary number. As for 99 vs 100 - same philosophy with discount prices, it's shorter and seems lower :)

elasticmachine commented 6 years ago

Original comment by @dakrone:

The moment we try to resolve the index and read its mapping, either we blow up and chose to ignore the field as if it doesn't exist.

I think my preference would be to read it in the mapping (without blowing up), but blow up if the field is ever referenced with something like "we don't support fields of type [geo]"

The issue we're having is not only it's fairly hard for us to support the array concept, we'd have to use it for all fields since basically any primitive field in ES can be an array.

I'd prefer to by strict and be able to opt-in to leniency (even if it were a dynamic setting) for arrays, so default to throwing an exception. I think otherwise we may end up hiding information useful to our users without them knowing it.

Just an arbitrary number. As for 99 vs 100 - same philosophy with discount prices, it's shorter and seems lower :)

100 seems like a reasonable limit to me

elasticmachine commented 6 years ago

Original comment by @costin:

I think my preference would be to read it in the mapping (without blowing up), but blow up if the field is ever referenced with something like "we don't support fields of type [geo]"

How would we handle DESCRIBE TABLE in this case? Show the field (as unknown or type string) but if the user refers to it we blow up? Also how about * - do we filter the unknown types in this case or blow up? I like the proposal since it gives an option for the middle but also inconsistency in the edge cases.

I'd prefer to by strict and be able to opt-in to leniency Noted

100 seems like a reasonable limit to me 100 is it.

@nik9000 @imotov any opinions? Considering how close alpha is I'm leaning towards, potentially, opening up these settings in JDBC (through properties) and afterwards opening up the configuration in the grammar.

elasticmachine commented 6 years ago

Original comment by @dakrone:

How would we handle DESCRIBE TABLE in this case? Show the field (as unknown or type string) but if the user refers to it we blow up? Also how about * - do we filter the unknown types in this case or blow up?

I think for describing the table it should show as "UNSUPPORTED" type (or if there's a SQL equivalent, maybe not).

For *, I think filter/ignore the unsupported types, especially because I suspect SELECT * FROM ... is going to be quite popular

elasticmachine commented 6 years ago

Original comment by @imotov:

query.timeout - 90s page.timeout - 45s page.size - 1000

I think these settings depend on the types of mappings and queries that users are going to be making. I think we need to choose good defaults but make these dynamic and configurable. To be honest I don't think 90s is a good default. We don't really have a reliable way to timeout queries in Elasticsearch, and some of the queries that we write might run for a very long time. So, my preference would be to have no timeout here by default.

mapping.ignore.unsupported= true mapping.array.leniency = true

I don't think we need to make these configurable in the initial realize. Let's just make the decision. I am personally fine with making them lenient. We are bridging the gap here, so I think we need to be more lenient on SQL side than we are traditionally on Elasticsearch side, because we don't control the protocol.

I'd prefer to by strict and be able to opt-in to leniency (even if it were a dynamic setting) for arrays, so default to throwing an exception. I think otherwise we may end up hiding information useful to our users without them knowing it.

Can we do something in the middle, like showing unsupported fields in the schema, but giving them a special type and not supporting them in the queries? So, we don't hide information, but at the same time make the working with Elasticsearch easy,

query.max.nested.docs = 99

We might need to make this configurable.

run.as = See LINK REDACTED

Seems like we need this to be configurable, but I think this should go on connection string or SQL statement. I don't think this should be a parameter.

elasticsearchmachine commented 8 months ago

Pinging @elastic/es-analytical-engine (Team:Analytics)

wchaparro commented 6 months ago

superceded by ES|QL