antonmks / Alenka

GPU database engine
Other
1.17k stars 120 forks source link

Load Append data or Union Data #18

Closed georgezhlw closed 10 years ago

georgezhlw commented 11 years ago

Anton, Generally we need append/insert new data to an existing table. For this to work, either we need a append the new data to an existing file when store, or, we can store it to new file and UNION them together when loading. Do we have such feature in Alenka? thanks, George

antonmks commented 11 years ago

Hi George This is the next feature I plan to add to Alenka - support for insert/update/delete operations.

georgezhlw commented 11 years ago

Anton, If we can implement UNION before the IUD operation, it'll help a lot, as the former one is easier. regards, George

antonmks commented 11 years ago

George, Do you have a project that requires a union ?

georgezhlw commented 11 years ago

Anton, Since we need add data everyday, if no append support, then we need union as an alternative. thanks, George

On Fri, May 10, 2013 at 10:23 AM, Anton notifications@github.com wrote:

George, Do you have a project that requires a union ?

— Reply to this email directly or view it on GitHubhttps://github.com/antonmks/Alenka/issues/18#issuecomment-17733134 .

antonmks commented 11 years ago

I really don't know guys. I would prefer to implement a solid support for IUD operations and that would take a couple months. So it should be ready in October-November because I usually take a summer off. Can you tell me more about your project ?

georgezhlw commented 11 years ago

Anton, Alenka is a great software and I believe it will continue to improve. For my project, I need better performance. For example: Join 500million table with 12million table on id column, and select count top10 of another column, it took 15s on a Nvidia Titan. For now, I am developing my own GPU solution. The key is to have a server demon running, so that I can pre-allocate all required GPU memory. And I don't use thrust for easy management. regards, George

On Tue, May 14, 2013 at 8:50 AM, Anton notifications@github.com wrote:

I really don't know guys. I would prefer to implement a solid support for IUD operations and that would take a couple months. So it should be ready in October-November because I usually take a summer off. Can you tell me more about your project ?

— Reply to this email directly or view it on GitHubhttps://github.com/antonmks/Alenka/issues/18#issuecomment-17885149 .

antonmks commented 11 years ago

Hi George ! I looked at your SQL script and tried running it using a similar sized tables. Majority of the time is spent on copying the result of a join to host and then to GPU (for group operation). 500 million records * 40 bytes = 20GB. Using non-pinned host memory and copying the records to host and back would take at least 12-15 seconds. One obvious way to improve the performance would be to implement a combination of join and group by operators - select ... from A JOIN B on ... GROUP BY .... This way we can avoid copying the data back and forth.

Regards,

Anton

georgezhlw commented 11 years ago

Anton, thanks for the testing and analysis. totally agree with your solution. I tried to implement a similar join group by: preparing a mask of 500million, join with the 12million to get a mask, then join with 500million keys to get the final mask, based on this mask, do the groupby on the value column directly. If all the 3 columns are int, it'll finish in about 2 seconds including copy data. Cheers, George

On Fri, May 17, 2013 at 2:40 AM, Anton notifications@github.com wrote:

Hi George ! I looked at your SQL script and tried running it using a similar sized tables. Majority of the time is spent on copying the result of a join to host and then to GPU (for group operation). 500 million records * 40 bytes = 20GB. Using non-pinned host memory and copying the records to host and back would take at least 12-15 seconds. One obvious way to improve the performance would be to implement a combination of join and group by operators - select ... from A JOIN B on ... GROUP BY .... This way we can avoid copying the data back and forth.

Regards,

Anton

— Reply to this email directly or view it on GitHubhttps://github.com/antonmks/Alenka/issues/18#issuecomment-18052384 .

antonmks commented 10 years ago

Done.