apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.27k stars 1.23k forks source link

NULL value support for all data types #4230

Closed Jackie-Jiang closed 1 year ago

Jackie-Jiang commented 5 years ago

Currently in Pinot we don't have real NULL value support, but use some special default values for NULL. For dimensions, the default value is the minimum value for numeric types, "null" for STRING, empty byte array for BYTES; for metrics, the default value is 0 for numeric types, empty byte array for BYTES. #4214 adds support to treat empty byte array as NULL for BYTES, but it's not the general way to support NULL for all data types. We should add another type of index to mark if the value is null, and apply that while filtering or fetching the values to skip all NULL values.

kishoreg commented 5 years ago

Here is the high-level idea

One open question is, should we restrict generating presence vector to segment creation phase or allow generating them on the fly when we load the segment (similar to inverted index). Generating them on the fly means we need to generate presence vector from the forward index. We can't do this as we represent NULL using the defaultValue in fieldSpec. One workaround is to treat the default value in the forward index as NULL. I think this might work but not 100% sure. Does anyone see any issues with treating default value in forward index implies NULL?

For the first phase, I think restricting it to segment generation phase should be good enough.

Thoughts?

sunithabeeram commented 5 years ago

Few thoughts:

  1. How would defaultValue and isNullable in the FieldSpec work? Does one take precedence over the other?
  2. Although I like the idea of presence vector, would it be better to have it a bit more dynamic by allowing for a metadata field that says whether it indicates the presence or absence depending on the type of data? This is an interesting consideration from storage pov (as we can use whatever results in better storage representation), but would make execution a bit non-straight forward.
  3. Do we want to generate the bit vector for metric fields?
  4. Regarding the open question, if we don't allow on the fly generation, how would this work with older segments? I think interpreting existing default values to generate the bit vector sounds ok to me.
kishoreg commented 5 years ago

@sunithabeeram Changed bullet to the numbered list.

  1. When an input column value is NULL, there are two things we need to consider a. what do we store in the forward index b. How do we support x != NULL in the query without generating inverted index for every unique value. We need defaultValue to support a and this issues is about supporting second scenario.
  2. Not sure if ithe saving in storage space ts worth the complexity, the presence bit vector should be small. This is equivalent of storing one additional posting list.
  3. Yes, metric columns will benefit a lot from this feature
  4. Older segments will continue to use the default values during aggregation and assume that nothing is NULL. If the user is ok with considering defaultValue as NULL, we can support old segments as well by generating NULL vector on the fly.
mayankshriv commented 5 years ago

A few questions:

  1. How do we return NULL in response for selection queries?
  2. If we don't change what is stored in the fwd index, where/how to tie the default null value stored in the fwd index with the NULL keyword in (x != NULL) specified in the query? For new indices (generated after this feature is implemented), is there any value of supporting default null value?
  3. How will presence vector work for MV columns, where one of the MV values is NULL?
  4. From the proposal I take it that the idea is to filter out NULLs during filtering phase of query execution? What about the case where filtered columns don't have NULL, but the column being aggregated has NULL. When/where will these NULLs be filtered so that individual aggregation functions don't need to filter them out explicitly?
kishoreg commented 5 years ago

Good questions.

  1. We will still return the defaultValues
  2. The only reason we need defaultValue is to avoid storing NULL in the forward index. Typically, column stores skip storing null values in the forward index but that will impact random lookup speed. My goal is to support NULL without compromising on latency.
  3. I thought we just skip NULL values in MV columns. Presence bit for the row will be set if there is at least 1 non-null value.
  4. This is exactly what we want to avoid. Once we support this feature, one can add x != NULL for any column (it can be filter, aggregation, group by etc) . for e.g. select sum(m1) from T , the user will have to add m1 != NULL explicitly. Eventually, we can automatically rewrite the query to select sum(m1) from T where m1 != NULL.
mayankshriv commented 5 years ago

For 4: Is there a reason where users would not want to filter out NULL values for metrics? With default value of 0 for metrics, sum works. However, min/max/avg may give incorrect answers in presence of NULLs. And for specialized structures such as TDigest/HLL, the question becomes what value to store in the index such that the it is de-serializable into memory as well as result is not polluted due to presence of NULLs.

Also +1 to building presence vector in segment generation phase.

kishoreg commented 5 years ago

Can't think of a case where users would NOT want to filter out NULL for metrics. I feel our first goal should be able to give the users to get the behavior they want without compromising a lot on performance. So for hyperloglog one can do one of the following

  1. Set defaultValue as EmptyHyperloglog().bytes() which is some large byte array
  2. or set defaultNullValue as byte[] but add x != NULL during query. We support 1 already but it's expensive in terms of storage and performance (if there are too many NULLs). This issue will allow them to do 2.

But, as you mentioned, for aggregation functions such as min, max, avg user can't really get the behavior they want without support for NULL.

sunithabeeram commented 5 years ago

Interacting through the issue is a bit limiting as its hard to reference comments appropriately. Will try my best.

For #1 in the questions I asked;as I think more through this: isNullable seems to just be an optimization hint, correct? All columns are currently nullable, so adding that explicitly could be confusing - what does it mean when a user says isNullable=false and we encounter null values? Should an exception be thrown? (ie, if the column is not set as nullable OR a defaultValue is not specified, throw an excpetion; This is probably a better thing to do in my mind because currently our default defaultNullValues could really backfire for certain aggregations but this is introducing a behavior change - not sure if you had that in mind.) If its purely an optimization hint, we could probably build the presence vector at all times and avoid the extra field in schema.

mayankshriv commented 5 years ago

@kishoreg for 4: If it is safe to assume that users would always want to filter out the NULL for metrics, why not implicitly filter out NULLs for metrics in the first implementation (as opposed to eventually)?

@sunithabeeram Re-iterating our offline discussion (to share with all), unless there's a functional reason for adding isNullable in the fieldSpec, I'd recommend not adding it. We should auto derive it (and put in metadata) during segment generation. The only issue would be consuming segments, and in that cases, we can always build the presence vector for all columns. I see the following issues with requiring this information in fieldSpec:

  1. Not always possible for clients to know if a column may contain null, especially, when they don't own the upstream data pipeline.
  2. Yet another config to manage
kishoreg commented 5 years ago

Good suggestions @mayankshriv @sunithabeeram.

Yes, we can filter out NULLs for metrics automatically if we see the presence vector.
Also, agree with not adding isNullable in fieldSpec. We can generate this for all columns.

@mayankshriv what is the issue with consuming segments?

mcvsubbu commented 5 years ago

consuming segments will need a mutable presence vector. While committing the segment to persistent store, the presence vector should also be translated. This is where I thought the isNullable option will be used. If the user sets isNullable, then we use presense vector. Otherwise we use default value in the fwd index (or dictionary as the case may be). Perhaps call it handleNullValues instead of isNullable. The field is nullable always, just that whether we treat it as null or as default value is the question.

Default value for metric is 0. Metric can very likely have a value of 0, so translating 0 to null value (i.e. building the presence vector on the server while loading) is not an option, I would think.

sunithabeeram commented 5 years ago

@mcvsubbu is there an issue if the presence vector is always built?

If I understand things correctly, I believe the current state of handling nullables is a bit unfortunate. The scenario you mentioned is not just an issue for building presence vector, but, as @mayankshriv pointed out, it can actually result in incorrect aggregations.

Is the defaultNullValue of 0 just for metrics? Or is it based on datatype? In either case, should the actual behavior be corrected? We should not allow default "defaultNullValues" but we elevate the presence vector to a first class citizen - something that should always be present (no pun intended); if users provide a "defaultValue", we can use that if the column is missing for a row (and don't filter it out at any stage?)

kishoreg commented 5 years ago

@mcvsubbu Presence vector is nothing a but a single roaring bitmap for each column. When we ingest a row, we set the corresponding bit to 1 (not null)/0 (null). In case of real-time, it will be a mutable roaring bitmap and in the case of offline, it will be an immutable roaring bitmap.

mcvsubbu commented 5 years ago

Exactly what I was thinking. However, don't we need a switch to decide whether to use presence vector or retain current behavior? Or, are you thinking that the current behavior will be obsoleted (i.e. null becomes default value)

mcvsubbu commented 5 years ago

@mcvsubbu is there an issue if the presence vector is always built? No issue, just that we change the behavior from what people are used to. I am guessing there is at least someone out there who relies on the fact that a null will be auto-populated by default null value. Many applications also set custom default null values in their schema.

If I understand things correctly, I believe the current state of handling nullables is a bit unfortunate. The scenario you mentioned is not just an issue for building presence vector, but, as @mayankshriv pointed out, it can actually result in incorrect aggregations.

Is the defaultNullValue of 0 just for metrics? Or is it based on datatype? 0 for metrics, and Long.MIN or Int.MIN or "null" for dimensions.

In either case, should the actual behavior be corrected? We should not allow default "defaultNullValues" but we elevate the presence vector to a first class citizen - something that should always be present (no pun intended); if users provide a "defaultValue", we can use that if the column is missing for a row (and don't filter it out at any stage?)

kishoreg commented 5 years ago

@mcvsubbu there is absolutely no change in the existing behavior on what gets stored in the forward index in the segment. null values will be auto-populated with the default value. The only change I am proposing is to have an additional presence vector that will allow the users to skip rows where x is NULL by adding a predicate x != NULL. This will make sure that there is no change in existing behavior (even though its incorrect in some cases). Mayank is suggesting that we add this predicate automatically if x is a metric. This will fix the incorrect behavior. We can control this behavior via some feature flag if needed.

Jackie-Jiang commented 5 years ago

Just adding predicate x != NULL is not good enough. We need to plug-in the NULL handling logic into all predicates as well. E.g. x = NULL, x != n (other than NULL), x NOT IN (a, b, c) (other than NULL) should return true, all other predicates should return false.

kishoreg commented 5 years ago

@Jackie-Jiang yes.

Also, you logged an issue to document the behavior of default value #4188. Having that will definitely help. Is anyone working on that?

sunithabeeram commented 5 years ago

@kishoreg bumping up one point which I would like to get explicit acknowledgement/agreement on.

Currently, all columns are nullable. If user doesn't specify a defaultNullValue, we use "default" values (as mentioned in the description of this issue as well as confirmed by @mcvsubbu in one of the comments). The problem with the "default" defaultNullValues is that they can result in incorrect behavior for metrics or dimensions. I think it is important that this be corrected as we attempt to build presence vectors. Here is the behavior I would like considered going forward:

  1. Do not use default defaultNullValues. Use presence vectors to guide decisions in the rest of the exeuction phases as needed.
  2. If users specify defaultNullValue, that is used when the data for column is missing; the presence vector in this case does not indicate that the value is "missing".
  3. To maintain backward compatibility, existing schemas are updated to explicitly add the "default" defaultNullValues in the schema. (We may need some versioning info in schema config to direct this correctly).

If the above is not acceptable, it would be good to clarify the reasons we think its OK to continue with the existing default defaultNullValues in the face of incorrectness. And we should update the user facing documentation so its clear to users what they can expect.

Let me know what you think.

kishoreg commented 5 years ago

We should avoid asking users to upgrade schema to maintain backward compatibility. It will be challenging operationally.

If I understand you correctly, what you are suggesting is basically the same as my original proposal - adding isNullable field. Instead of adding isNullable explicitly in the fieldSpec you are proposing deriving that based on the presence/absence of defaultNullValue. I prefer having isNullable explicitly instead of deriving it based on defaultNullValue.

It might be easier if we break this into two parts and look at the options on the table. Segment Generation and query execution.

Segment Generation (Creating Presence Vector): Options

  1. Create the presence vector by default for all columns. If the column is present and not null in the row, set the presence bit.
  2. Create the presence vector only when isNullable = true or createPresenceVector = true. By default createPresenceVector = false
  3. Same as 2 but derive it based on presence/absence of defaultNullValue

Query Execution: Options

  1. Let users use x != NULL explicitly in the query to filter out rows with NULL values
  2. Pinot automatically uses the presence vector in the query execution as needed.

Note: We are not going to change what's stored in the forward index. We will always use defaultNullValue (it can be default defaultNullValue based on dataType or something user specifies in the schema).

I prefer starting with option 2 for Segment Generation and option 1 for Query Execution. This will be backward compatible with current behavior and provide the flexibility to the user to achieve the behavior they want by changing the fieldSpec and query as needed.

We can always move to 1 in Segment Generation by changing the default value of createPresenceVector=true in phase 2. We can also start to move to option 2 for Query Execution over time.

Jackie-Jiang commented 5 years ago

I would prefer always assume all columns are nullable unless users explicitly mark it nonNull (which is very rare, but might be useful for boolean type. A default null value is required). For the existing segment, leave it as is. For default column, generate empty presence vector if the column is nullable. For query execution, we should automatically filter out NULLs as needed, and also support explicit filters on NULLs.

siddharthteotia commented 5 years ago

In the long term, may be we can consider leveraging Apache Arrow's inmemory columnar and wire format for our execution engine, then we will get support for nullability as part of moving to Arrow (not suggesting that we should move to Arrow to support nullability but bringing up Arrow as something we may want to consider).

An Arrow vector is an off-heap data structure for a column of any particular type (it supports both primitive and complex type). The vector has 2 buffers in direct memory -- data buffer for storing column values and nullability buffer (a bitmap) for indicating if the corresponding column value is NULL or not.

kishoreg commented 5 years ago

@siddharthteotia Yes, at some point we would love to support various formats such as Parquet, ORC etc and Arrow will definitely be something that we can consider. For now, Pinot execution code is tightly coupled with storage format and bringing in Arrow will require a lot of changes.

icefury71 commented 5 years ago

@kishoreg One potential issue with Query Execution Option 1 (explicitly add x != NULL) is when there are multiple aggregations: eg: select sum(col1), avg(col2) where ... col1 !=NULL AND/OR ... col2 != NULL

Here we need to come up with the right filtering condition. Do you think making it implicit might be better ?

kishoreg commented 5 years ago

@icefury71 Good point. I think the challenge is to come up with the right filtering condition. Whether the filtering is explicit or implicit is just a syntactic convenience.

icefury71 commented 4 years ago

Updating this thread with some of the offline discussions I had with a few folks (Devesh and @fx19880617 ). This proposal is to filter out null values in the form of predicates (leaf nodes of query execution). This has 2 potential issues:

1) Incorrect results when transforms are used in aggregation functions (eg: MAX(ADD(A, B)) if either A or B is null, ADD should return NULL. However if we filter out nulls very early, then we will return a non null value here - which is not consistent with ANSI SQL. 2) Multiple aggregation functions in the same query: This also might be inconsistent if we filter out the nulls very early in query execution.

Looks like a better approach might be to "tag" DocIDs as null or non-nulls and let the aggregation function (or another intermediate node) process it in the right manner. However, this will affect the entire query execution flow and is a much bigger effort than what's proposed in this issue.

Thoughts ?

kishoreg commented 4 years ago

@icefury71 Good questions. I don't have a good answer. Let's create a separate issue on how to use presence vectors in the query execution layer. At a high level, there are 3 possible options

For now, let us focus on supporting 1. we can start design discussion for 2/3

icefury71 commented 4 years ago

Opened PR: https://github.com/apache/incubator-pinot/pull/4585

CurriedFloat commented 3 years ago

I'm trying to plot a timeseries with gaps showing null/missing data.

Trying to avoid doing:

Can this be done in the query?

icefury71 commented 3 years ago

I'm trying to plot a timeseries with gaps showing null/missing data.

Trying to avoid doing:

  • Caller-side mask on the data and the NULL value vector.
  • Caller-side mask using some defaultNullValue.

Can this be done in the query?

I'm not sure I understand your question. I'm assuming you want to filter out null values in your query response, does the column IS NOT NULL not work for you ?

Or do you want this to happen automatically without specifying such predicates ?

CurriedFloat commented 3 years ago

No I want to include null values but as null and not as defaultNullValue so I can distinguish the two cases.

If defaultNullValue is zero and nullHandling is enabled and some points are actually zero.

t: 1 v: 110 t: 2 v: 0 <- actually zero t: 3 v: 114 t: 4 v: null <- null t: 5 v: 113

So I can make a plot with gaps: https://codepen.io/curriedfloat/pen/WNprwWP

icefury71 commented 3 years ago

Unfortunately that's not possible with the current implementation. Current implementation will ignore nulls or treat them as default values in the filter operator. Everything else in query processing currently, does not need to care about nulls. It is definitely in the scope of this issue to ensure all query processing layers can handle null values - however hasn't been done yet.

monicaluodialpad commented 1 year ago

Hi, (just want to make a summary of this) in this case if a column in INT type, and we would like to check IS NOT NULL value, does it mean we need to do

column IS NOT NULL AND column != 0
Jackie-Jiang commented 1 year ago

@monicaluodialpad If the purpose is just to exclude the null values, with the null value enabled, column IS NOT NULL is enough, which won't filter out the actual 0s