Data partitioning architecture

archenroot commented 7 years ago

First of all good stuff here!

I have question about, how Alenka is designed to partition across multiple devices. Suppose I have big GPU cluster. Does it work in a way of making some "super" partitions each assigned to 1 device and while making parallel queries, each "super" partition is then "mini" partitioned for parallel execution within specific GPU?

I am more or less experimenting with this code in the moment and CUDA in general, so sorry for not accurate idea.

Thank you.

antonmks commented 7 years ago

No, for now alenka can use only a single gpu. But it shouldn't be difficult to modify it to partition the data and process it on multiple gpus !

On Tue, Nov 1, 2016 at 4:19 PM, archenroot notifications@github.com wrote:

First of all good stuff here!

I have question about, how Alenka is designed to partition across multiple devices. Suppose I have big GPU cluster. Does it work in a way of making some "super" partitions each assigned to 1 device and while making parallel queries, each "super" partition is then "mini" partitioned for parallel execution within specific GPU?

I am more or less experimenting with this code in the moment and CUDA in general, so sorry for not accurate idea.

Thank you.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/antonmks/Alenka/issues/106, or mute the thread https://github.com/notifications/unsubscribe-auth/ABhkFETxXahi3KehvIHcZ0W54MavTOsbks5q5zxzgaJpZM4KmFMt .

archenroot commented 7 years ago

Ok, clear. Thank you very much.

archenroot commented 7 years ago

Can you please suggest where and how to implement the partitioning? Another idea came to my head is data sharding or global cluster configuration:

sharding -> only specific data ranges will live on GPU or GPU cluster
global cluster -> imagine I have 4 machines each with 8 GPUs connected via PCIExpress network and I would like to run one big database of Alenka on them

If you can point not only ideas, but real possible points where in the code implement the functionality, it might be helpful.

I see big potential in this application, I am interested in doing some comparison between Alenka and in-memory engines (Redis, etc.)

Thanks a lot!

antonmks commented 7 years ago

Keep in mind that Alenka is pretty experimental, it is not suitable for production use at this point. Concerning partitioning you could write some program which would run some modified Alenka instances on different nodes and collect the results at a master node. Or something like that.

archenroot commented 7 years ago

I understand that it will require effort to implement this feature, but more or less asking you about your opinions.

2016-11-27 17:30 GMT+01:00 Anton notifications@github.com:

Keep in mind that Alenka is pretty experimental, it is not suitable for production use at this point. Concerning partitioning you could write some program which would run some modified Alenka instances on different nodes and collect the results at a master node. Or something like that.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/antonmks/Alenka/issues/106#issuecomment-263131696, or mute the thread https://github.com/notifications/unsubscribe-auth/AAhyKL9w9Ca4exxMn8fLqhqFcBtugeC-ks5rCbA-gaJpZM4KmFMt .

hurdad commented 7 years ago

I was working on using Hadoop HDFS as data storage, this would allow nodes to query the same dataset.

archenroot commented 7 years ago

Thanks Alexander,

well, what I am interested in is dynamic partitioning of the data, which could be seen as grouping of segments.

Imagine I have a table DATA with 2 columns: Type Value A something A something else B something else B something else

Imagine each type has 1 billion of records, what I am interested in is to create 2 groups of segments, one for type A, another for type B

And when the query comes like select from date where type = 'A', I will pickup only segments registered with A. This will of course require more fine grained processing of segments than it is in the moment happening. I also would like to store each partition on different GPU cluster(just as ability to tune the performance).

Ladislav

2016-12-20 11:35 GMT+01:00 Alexander Hurd notifications@github.com:

I was working on using Hadoop HDFS as data storage, this would allow nodes to query the same dataset.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/antonmks/Alenka/issues/106#issuecomment-268210107, or mute the thread https://github.com/notifications/unsubscribe-auth/AAhyKEO2hp4JSvD_28a3cQqQjzR0tc56ks5rJ69pgaJpZM4KmFMt .

hurdad commented 7 years ago

So, like you could partition by date and have data from each day on different servers?

archenroot commented 7 years ago

Exactly, this need more brainstorming. In case of monthly based partitions i would like to have whole year stored on one. Just dummy example of generic approach.

sent from mobile, Ladislav

On 20 Dec 2016 1:06 p.m., "Alexander Hurd" notifications@github.com wrote:

So, like you could partition by date and have data from each day on different servers?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/antonmks/Alenka/issues/106#issuecomment-268227265, or mute the thread https://github.com/notifications/unsubscribe-auth/AAhyKHHrbYfND-bi68uy7uhDpgxc4U5Iks5rJ8TggaJpZM4KmFMt .

hurdad commented 7 years ago

one issue I see is that you would need to pre sort the data your loading by your partition field when creating the segments.

anyway, the main reason I was perusing a clustered /distributed approach was to allow for concurrent queries on the dataset. currently Alenka only supports single queries per GPU

archenroot commented 7 years ago

Exactly, good point, but these are 2 different things here when we jump into clustering which we need to target:

Create a network stack around gpudb to reach the single process on gpu node I am considering using Netty Java to build the network around gpudb, or go with pure C++ like Netty library of Facebook https://github.com/ facebook/wangle Then you need loadbalancer and cluster control, this will require monitoring of what GPU works on what type of task, especially you need to know how much memory you have available,etc. There are 2 frameworks I am looking into for multigpu cuda: CUDA aware-MPI and rCUDA
Make multiple queries to be executed in parallel - current design is completely not aware of that, also usually you create a segments based on extracted data size to be processed(to limit number of segments and therefore number of offloads), so when one query is executed, the VRAM of one GPU is fully used to its limit. Maybe with Nvidia Tesla 32GB ram (16 per gpu core) could bring more space to create segments smaller than VRAM. Then we will need to immplement CUDA Streams available from Kepler architecture (correct me if I am wrong here).

I was doing some research on multi node interconnect and PCIe network could be and option ( I saw Dolphin adapters doing 50Gbit/s with larger messages of 4K), or going with infiniband which could reach already over 100Gbit/s.

Regards,

Ladislav

2016-12-20 13:35 GMT+01:00 Alexander Hurd notifications@github.com:

one issue I see is that you would need to pre sort the data your loading by your partition field when creating the segments.

anyway, the main reason I was perusing a clustered /distributed approach was to allow for concurrent queries on the dataset. currently Alenka only supports single queries per GPU

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/antonmks/Alenka/issues/106#issuecomment-268232961, or mute the thread https://github.com/notifications/unsubscribe-auth/AAhyKMxR3n7_nuwovWLAKhD447PIjtqgks5rJ8ubgaJpZM4KmFMt .

dkourilov commented 7 years ago

Hey guys,

I think this thread worth to mention http://www.bitfusion.io/ and usage example from http://tech.marksblogg.com/billion-nyc-taxi-rides-aws-ec2-mapd.html

I'm not affiliated with both of them, Bitfusion does GPU virtualisation, just my 2 cents.

antonmks / Alenka

Data partitioning architecture #106