antonmks / Alenka

GPU database engine
Other
1.17k stars 120 forks source link

Problems with loading data #3

Closed bmanola closed 11 years ago

bmanola commented 11 years ago

Dear Anton,

First i want to say this is remarkable project. Analyzing large dataset using GPUs is great idea.

I have a problem with correct select statement. I load file (around 30M rows) into BINARY structure. When load this binary structure and make basic SQL D := SELECT sccode AS sccode1, product AS product1, SUM(sale) AS sale_sum FROM A GROUP BY sccode, product; I get less rows then same sql but on database.76K instead of 500k.Some products are missing which i for sure have in bigtable in DB.

More strange is when i user FILTER product <= 100000 (this number is not important max product code is 40000) i get around 160K

Can you tell whats wrong with my sql statement?

For loading data i use A := LOAD 'bigtable.csv' USING (',') AS (uniqueid{1}:int, ccode{2}:varchar(10), acode{3}:varchar(10), sccode{4}:varchar(10), supplier{5}:int, product{6}:int, sale{15}:decimal); STORE A INTO 'bigtable' BINARY;

Best Regards,

antonmks commented 11 years ago

Hello ! Sorry for the bugs, the last few days I was working on fixing the problem . I will post the update on a github today, let me know if it works for you. Oh, I probably should add to the manual that you always need to have a COUNT(field) in a group by statement. So try it like this : D := SELECT sccode AS sccode1, product AS product1, SUM(sale) AS sale_sum, COUNT(product) AS cnt FROM A GROUP BY area, product;

If it doesn't work let me know and I will try to reproduce the problem.

Regards,

Anton

On Sun, Jan 27, 2013 at 8:08 PM, bmanola notifications@github.com wrote:

Dear Anton,

First i want to say this is remarkable project. Analyzing large dataset using GPUs is great idea.

I have a problem with correct select statement. I load file (around 30M rows) into BINARY structure. When load this binary structure and make basic SQL D := SELECT sccode AS sccode1, product AS product1, SUM(sale) AS sale_sum FROM A GROUP BY area, product; I get less rows then same sql but on database.76K instead of 500k.Some products are missing which i for sure have in bigtable in DB.

More strange is when i user FILTER product <= 100000 (this number is not important max product code is 40000) i get around 160K

Can you tell whats wrong with my sql statement?

For loading data i use A := LOAD 'bigtable.csv' USING (',') AS (uniqueid{1}:int, ccode{2}:varchar(10), acode{3}:varchar(10), sccode{4}:varchar(10), supplier{5}:int, product{6}:int, sale{15}:decimal); STORE A INTO 'bigtable' BINARY;

Best Regards,

— Reply to this email directly or view it on GitHubhttps://github.com/antonmks/Alenka/issues/3.

bmanola commented 11 years ago

Hi, Anton,

I am sending you output if that might help

Process count = 6200000 BINARY LOAD: A bigtable SELECT D A cycle 0 select mem 851771392 final select 81233 select time 3.22 STORE: D mytest.txt | SQL scan parse worked cycle time 3.534

I noticed that process makes only one cycle instead of 4(or 5).

Sql for selecting is same as you suggested.

Regards,

antonmks commented 11 years ago

I think I fixed it now. Sorry for the bug, just not enough testing on my part ! I have generated 30 million records and the query takes exactly 2 seconds on my GTX 580.