chrislusf / glow

Glow is an easy-to-use distributed computation system written in Go, similar to Hadoop Map Reduce, Spark, Flink, Storm, etc. I am also working on another similar pure Go system, https://github.com/chrislusf/gleam , which is more flexible and more performant.
3.2k stars 248 forks source link

managing clusters and all that #11

Closed joeblew99 closed 8 years ago

joeblew99 commented 8 years ago

just use consol http://blog.scottlowe.org/2015/02/06/quick-intro-to-consul/

it incorporates allot of what you need and is written in go. i use it for micro services, and for this project its great because it wil tel you the appress of all the things you depend on too. Like the file servr or mongodb or any other data source or sink.

chrislusf commented 8 years ago

Consul is nice. It seems handling service discovery and key value store, kind of like zookeeper.

Current "leader" component also does resource allocation to each driver program. If using Consul or other stuff, the resource allocation will need its own separate instance.

One of the goal is to keep the number of components minimal, and put the logic where it should be.

In future, Glow will integrate with Consul, zookeeper, Mesos, YARN, etc. But for now, let's build a system whee it should be like, instead of trying to fit into other project's existing APIs.

joeblew99 commented 8 years ago

I used zookeeper and mesos before on other ML projects. Consul is go and light. Please consider it going forward.

I am getting bogged down with the data ranging from data sources. For example I have 4 tb of data, and put it on a SAN, so the whole cluster can see it as a data source. Each compute node needs to request a data range of the whole range. You know what I mean ? Any thoughts ?

Are you planning to build a driver for each data source type ? S3 is easy, because it just uses the http range operators. SQL also because its just a pagination.

chrislusf commented 8 years ago

One executor can devide the data into blocks and map to

On Wednesday, December 2, 2015, joeblew99 notifications@github.com wrote:

I used zookeeper and mesos before on other ML projects. Consul is go and light. Please consider it going forward.

I am getting bogged down with the data ranging from data sources. For example I have 4 tb of data, and put it on a SAN, so the whole cluster can see it as a data source. Each compute node needs to request a data range of the whole range. You know what I mean ? Any thoughts ?

Are you planning to build a driver for each data source type ? S3 is easy, because it just uses the http range operators. SQL also because its just a pagination.

— Reply to this email directly or view it on GitHub https://github.com/chrislusf/glow/issues/11#issuecomment-161346427.

chrislusf commented 8 years ago

(last email was sent by mistake)

To read the data source, you can give the data location to a mapper, and the mapper can divide the data location into data ranges if possible, and partition the data ranges, and the following mapper can fetch one range on one executor.

The adapters for data sources would varies a lot. I am thinking to put them into an external package. Ideas/Pull Requests are welcome!

Chris

On Wed, Dec 2, 2015 at 8:15 AM, Chris Lu chris.lu@gmail.com wrote:

One executor can devide the data into blocks and map to

On Wednesday, December 2, 2015, joeblew99 notifications@github.com wrote:

I used zookeeper and mesos before on other ML projects. Consul is go and light. Please consider it going forward.

I am getting bogged down with the data ranging from data sources. For example I have 4 tb of data, and put it on a SAN, so the whole cluster can see it as a data source. Each compute node needs to request a data range of the whole range. You know what I mean ? Any thoughts ?

Are you planning to build a driver for each data source type ? S3 is easy, because it just uses the http range operators. SQL also because its just a pagination.

— Reply to this email directly or view it on GitHub https://github.com/chrislusf/glow/issues/11#issuecomment-161346427.

joeblew99 commented 8 years ago

OK that makes sense. I just need to find it in the code. The reflection abstraction makes it a bit tough still. Any examples or links to code would be awesome. I am still playing with the current examples.

I am happy to contribute drivers. For file based maybe Seaweedfs ? It's your baby and will make a great fs for this ?

For db, maybe coachroachdb or a simple kV store. Coachroachdb used to have a kV api but they removed it. For me I would like to use a very simple dB store, because I am wishing to run it on mobile. I run other golang code on ios and android.

Again links into the code would really help, so I can see where I can start

chrislusf commented 8 years ago

You should not need to write any reflection code.

Here is some seudo code for hdfs.

func AddHdfsFile(f flow.Flow, hdfsLocation string) flow.Dataset{ //list block files under hdfsLocation blockList := some_func(hdfsLocation) f.Slice(blockList).Map(func retrieveOneFile(blockLocation string, lines chan string){ // read data, and feed into "lines" channel }) }

On Wed, Dec 2, 2015 at 9:45 AM, joeblew99 notifications@github.com wrote:

OK that makes sense. I just need to find it in the code. The reflection abstraction makes it a bit tough still. Any examples or links to code would be awesome. I am still playing with the current examples.

I am happy to contribute drivers. For file based maybe Seaweedfs ? It's your baby and will make a great fs for this ?

For db, maybe coachroachdb or a simple kV store. Coachroachdb used to have a kV api but they removed it. For me I would like to use a very simple dB store, because I am wishing to run it on mobile. I run other golang code on ios and android.

Again links into the code would really help, so I can see where I can start

— Reply to this email directly or view it on GitHub https://github.com/chrislusf/glow/issues/11#issuecomment-161377739.

chrislusf commented 8 years ago

Added a HDFS example.