antonmks / Alenka

GPU database engine
Other
1.17k stars 120 forks source link

Is there in AlenkaDB such a thing as NULL? #21

Open AlexeyAB opened 11 years ago

AlexeyAB commented 11 years ago

Hello Anton. Is there in AlenkaDB such a thing as NULL? I get an exception, when I load the data from the pipe-delimited files:

1388582970|152470|2025211056|14.09.2012|0|0|0|0|0|0|0|0|0|0|196407|ABC||||||

This variant works fine:

1388582970|152470|2025211056|14.09.2012|0|0|0|0|0|0|0|0|0|0|196407|ABC|6|5|4|3|2|1

And if NULL is not implemented, then the quickest solution would be to use the maximum value possible for this type of field, when downloading from the pipe-delimited file.

antonmks commented 11 years ago

Hi Alexey ! So far there are no NULLs in Alenka. I agree that it would be a good idea to have NULLs in database and that storing NULLs as maximum values would be the best way to implement it. However it would complicate operations on numbers - for arithmetic operations we would have to check for NULL values before adding or multiplying vectors. It also would be necessary to implement functions (like NVL in Oracle ) to convert NULLS to some values. So probably it is not critical to implement NULLS right now - for now NULL values become 0's and empty strings. I would rather focus now on improving performance and may be getting some real projects to use Alenka. Btw, the exception that you got from loading files is a bug, so I fixed it.

Best regards,

Anton

AlexeyAB commented 11 years ago

Hi Anton! Big thanks for your fix! Yes, I am now testing on a small part of the real data from the customer's project and faced with such difficulties. For testing right now on a real project hampered by the absence of the free GPU at the customer at this moment. But I can download a small part of data, that will skipped the automatic security system to test the project on my computer.

You are right, that it would complicate operations on numbers. But when joining tables the maximum value will be much less likely to coincide with the real values, ​​than it will happen with 0. And in this case we can't solve this problem.

Please, could you implement it, an example through #define - by switching between 0 and a maximum values, for integers and decimals? And NULL strings also always remain empty - all chars are 0.

In this case, for arithmetic operations on numbers, the maximum value converts to zero through easily implementing functions Oracle decode/nvl:

SQL: select sum(nvl(field, 0)) from table;

C++: nvl(field, 0);

template <typename T>
__host__ __device__ inline T decode(T &field, T const& compare_val, T const& then_val, T const& else_val)
{
    return (field == compare_val)? then_val: else_val;
}

template<typename T>
__host__ __device__ inline T nvl(T &field, T const& then_val)
{
    decode(field, NULL_VAL, then_val, field);
}
antonmks commented 11 years ago

Hi Alexey ! You are right about joining on NULL values - they can conflict with existing valid 0 values. I will have to modify a bit the compression routines which check min and max values of vectors and also the data load procedure and arithmetic calculations as well. I suppose that you still want NULLS to be converted to 0's in calculations ? That is, 1 + 3 + NULL = 4 ? Or the Oracle way, 1 + 3 + NULL = NULL ?

Anton

AlexeyAB commented 11 years ago

Hi Anton! If this is not difficult, it is better as in Oracle: 1 + 3 + NULL = NULL

  1. In expression behaves as maximum value, in summation behaves as overflow and saves its maximum value, i.e. 1 + 3 + NULL = NULL.
  2. In the aggregate functions behaves as 0. An example sum(field) equal to: sum(nvl(field, 0)). And count(field) equal to: sum(decode(field, NULL, 0, 1)).
  3. In JOIN behaves as maximum value. But NULL must not match in join. (An example, if using sort-merge-join and ordered-group-by, then before this, sorted vectors must reduce size to exclude maximum values.)

Behavior as in Oracle in the examples:

select 2 + NULL from dual;

// result: NULL

with t1 as (
    select NULL as f1, 'a' as f2 from dual 
    union all
    select 2 as f1, 'b' as f2 from dual
)
select sum(f1) from t1;

// result: 2

with t1 as (
    select NULL as f1, 'a' as f2 from dual 
    union all
    select 2 as f1, 'b' as f2 from dual
 ), t2 as (
     select NULL as f1, 'a' as f2 from dual 
    union all
    select 2 as f1, 'b' as f2 from dual
)
select * from t1 join t2 on t1.f1 = t2.f1;

// result: 2 b 2 b

Thanks for your future fixes!

andrebaaij commented 10 years ago

Hi,

I came across the discussion, and I love to see how this project will turn out. AlexeyAB correctly explains how nulls should be handled. I just wanted to add that wikipedia has a great article on how nulls are handled: http://en.wikipedia.org/wiki/Null_(SQL)#Arithmetic_operations it references to the ISO/IEC (2003). ISO/IEC 9075-2:2003.