Arises exceptions when JOIN or FILTER tables, with a tables size of 1000000 rows.

AlexeyAB commented 11 years ago

Hi Anton! Very often arise exceptions when joining or filtering tables. This occurs for any value of the process count. For example there is an uncaught exception and exit from the program, with a table size of 1000000 rows:

E := LOAD 'table3' BINARY AS (event_type{1}:varchar(50));

EF := FILTER E BY event_type = "Change1" OR event_type = "Change2" 
    OR event_type = "Change3";

And second example, for the size of 1000000 rows in each of two tables, at attempt to make JOIN I see this error:

D := LOAD 'table1' BINARY AS (f1{1}:varchar(100), d_id{10}:int );
B := LOAD 'table2' BINARY AS (b_id{6}:int);

EDCB := SELECT
         f1 AS f1
     FROM B JOIN D
        ON b_id = d_id;

F:\GPU_test_DWH>AlenkaDB.exe select2.sql
Process count = 6200000
BINARY LOAD: D table1
Reading 990980 records
BINARY LOAD: B table2
Reading 983510 records
JOIN EDCB b_id d_id
Cuda error in file 'C:/VSProject/AlenkaDB/cm.cu' in line 922 : unknown error.

Best regards, Alexey

antonmks commented 11 years ago

Do you have string lengths of 50 and 100 defined in strings.cu ? UNROLL COUNT ?

Anton

AlexeyAB commented 11 years ago

Hi Anton. Yes I have strings with length 50 and 100. Sorry for second example, here was my mistake in numbers of fields.

And in first example bug happens in next simple case.

Have the simplest file e.txt with 4 strings: abc abc abc abc

Try to start this SQL:

E := LOAD 'e.txt' USING ('|') AS (event_type{1}:varchar(20));

EF := FILTER E BY event_type == "abc";

STORE EF INTO 'res.txt' USING ('|') LIMIT 10;

Get an exception.

And this SQL works:

E := LOAD 'e.txt' USING ('|') AS (event_type{1}:varchar(20));

STORE E INTO 'res.txt' USING ('|') LIMIT 10;

But in res.txt there are 4 such lines :

And in additional, you can change in TPC-H file q2m.sql this string:

RF := FILTER R BY r_name == "EUROPE";

to this:

RF := FILTER R BY r_name == "EUROPE.";

or to this:

RF := FILTER R BY r_name == "EUROP";

And get an exception in JOIN: J_PS := SELECT ... It happens for 10GB/100GB in TPC-H.

Alexey.

antonmks commented 11 years ago

Seems like this is caused by a bug in join operation - it incorrectly handles the empty datasets. I fixed it so q2m.sql works now.

Now, Alenka doesn't work directly with text files - the only thing you can do is to load a text file into a binary file. So please use the data files to run the SQL commands. It shouldn't be difficult to support the text files but it won't make sense from a performance standpoint because the speed will always be text-file parsing bound.

AlexeyAB commented 11 years ago

Thanks for fix. You are right, working directly with text-file is not a priority feature. But currently I get empty result of working even with binary files. An example, again we have the simplest file e.txt with 4 strings:

abc
abc
abc
abc

Start command line:

AlenkaDB.exe load.sql
AlenkaDB.exe filter.sql

load.sql

E := LOAD 'e.txt' USING ('|') AS (event_type{1}:varchar(20));
STORE E INTO 'e' BINARY;

filter.sql

E := LOAD 'e' BINARY AS (event_type{1}:varchar(20));
EF := FILTER E BY event_type == "abc";
STORE EF INTO 'res.txt' USING ('|') LIMIT 10;

res.txt is empty. File size = 0.

Now if I make changes in filter.sql to this:

E := LOAD 'e' BINARY AS (event_type{1}:varchar(20));
STORE E INTO 'res.txt' USING ('|') LIMIT 10;

res.txt now contain:

|
|
|
|

antonmks commented 11 years ago

Fixed. Let me know if it is not working for you.

Regards,Anton

AlexeyAB commented 11 years ago

Thanks, it's working. But now I can not save intermediate data in the binary. Let's, we have previous example, but in the end we want to store in BINARY instead of pipe-delimited-text. again we have the simplest file e.txt with 4 strings:

abc
abc
abc
abc

Start command line:

AlenkaDB.exe load.sql
AlenkaDB.exe filter.sql

load.sql

E := LOAD 'e.txt' USING ('|') AS (event_type{1}:varchar(20));
STORE E INTO 'e' BINARY;

filter.sql

E := LOAD 'e' BINARY AS (event_type{1}:varchar(20));
EF := FILTER E BY event_type == "abc";
STORE EF INTO 'ef' BINARY;

get messages:

Process count = 6200000
BINARY LOAD: E e
Reading 4 records
FILTER EF E 583483392
MAP CHECK segment 0 R
filter is finished 4 582434816
filter time 0.798 583483392
STORE: EF ef
LOADING
SQL scan parse worked
cycle time 0.802

But no files are created and EF-table is not saved.

Why is it critical for me? I'm trying to get around the lack of memory, by using intermediate temporary binary-tabels, when working with real data, because in 90% of my test queries the program crashes with an exception.

Best regards, Alexey

antonmks commented 11 years ago

This is a part that is missing right now. I understand that this is very important to be able to store the results as binary data so I'll get on it as soon as possible. Basically my schedule right now is as following : 1.Change join implementation to sorted merge - it is much more convenient for me and there is no need to use CUDPP. 2.Add support for nulls. 3.Add support for storing binary results. 4.Fix a lot of other bugs. 5.Add insert/update/delete operation support.

Best regards,

Anton

AlexeyAB commented 11 years ago

Ok. But still I have a big request, if you can to solve the problem with the lack of GPU RAM. I can not use AlenkaDB on any real queries for that reason :)

If you are going to use the sort merge join, and if you bring a sorting for all types of data in a separate file, such as it done for strings (strings_sort_host.cu & strings_sort_device.cu), then I could implement a hybrid sorting (sort parts on the GPU, and at this time in parallel doing merge (std :: merge_inplace / thrust :: merge) them on CPU), it is required by the lack of GPU RAM. This will solve the problem of lack of GPU RAM and problem of slow of the CPU.

As well it can be done for GROUP BY and JOIN for all types and for gpu/hybrid(cpu/gpu). At this time, as I understand it, just only the sorting of strings can be performed on the host, or the GROUP BY and JOIN too can for all types?

Are you going to turn off from work on a project for the summer?

antonmks commented 11 years ago

The GPU memory limits shouldn't be an issue because all the processing is done on a pieces small enough to be placed in a GPU. It is just that I have 3GB of GPU memory and probably the partitioning logic is somewhere off so it fails on a GPU with smaller memory. Can you give me a couple of queries that fail specifically because the GPU is out of memory ? I would try to run it and see what I can do. Oh, by the way, all types can be sorted on a host, not just strings.

Regards, Anton

AlexeyAB commented 11 years ago

Why do I think about the lack of GPU RAM. I look at GPU-Z 0.5.5 in Sensors -> Memory used. GPU-Z can be free downloaded from the Internet. For example, if I have many tabs in Chrome, running MS Outlook and other programms, it displays Memory used: 450 MB GPU RAM - and q1.sql of TPC-H 10GB off with an error on the D: = SELECT ... GROUP BY. At this time is free: 550 MB GPU RAM and 4 GB CPU RAM. If I close some programs, then Memory used will equal to 193 MB, and will be the same exception. It was only when I close all programs and Memory used is 129 MB, then the query runs well. At this time is free: 871 MB GPU RAM and 5.5 GB CPU RAM.

Also in the GPU-Z shows the graph of used GPU RAM (but in a small window), where the peaks are seen 100% of Memory used.

Of course on a real server with GPU Memory used will be near 0. But if this problem occurs now at low volume of data, then there it will be occurs at a little more volume.

Unfortunately for my real data, GPU Memory used is equal to 129 MB (871 MB free) is not enough. And I can not send them, because this is the real financial data of the customer. But the problem I have described is reproduced on q1.sql of TPC-H 10GB, as I described above.

Regards, Alexey

antonmks commented 11 years ago

Well, I can suggests creating a database with smaller segments : alenka.exe -l 3000000 load_lineitems.sql This will create data files with 3000000 records in a segment as opposed the default value of 6000000 records. You can play with this parameter and see what is the optimal segment size for your GPU.

Regards, Anton

AlexeyAB commented 11 years ago

Thanks. This has helped in the case of the TPC-H lineitem, but did not help with my real data. Best regards, Alexey

AlexeyAB commented 11 years ago

Some additional questions. This option -l affects only the loading data from a text file to binary file, or in other cases too? And how data are stored in RAM, an example, after JOIN from different tables with different size segments? And when this result is grouping by, what size pieces that are processed on the GPU RAM?

Best regards, Alexey

antonmks commented 11 years ago

-l affects only the loading data from a text file to binary file. After join the data is stored in a host memory, uncompressed, in one big piece. When grouping by, Alenka estimates how much GPU memory it needs to aggregate the data and divides the source data into pieces, then it processes every piece. The number of pieces is calculated in a function setSegments in cm.cu file. You can change current value to a bigger one if you think that this might be a problem.

Best regards, Anton

AlexeyAB commented 11 years ago

Hi Anton! Sent an email to antonmks@gmail.com with an example of a script on which the error occurs when grouping.

The parameter -l does not solve the problem. Also, if this parameter is set below 500000, an error occurs even when you load data from a text file into a binary.

Best regards, Alexey

antonmks commented 11 years ago

I didn't receive the email ...

Regards, Anton

AlexeyAB commented 11 years ago

Ok. I re-sent it. Best regards, Alexey

antonmks commented 11 years ago

I sent you an email.

Best regards, Anton

antonmks / Alenka

Arises exceptions when JOIN or FILTER tables, with a tables size of 1000000 rows. #25