Running ATM on a cluster

beevabeeva commented 5 years ago

Hi. Sorry if I missed it in the docs or Readme, but I can't seem to find details about running ATM on a cluster (local). Do I have to implement this myself using something like Apache Spark?

Thanks

csala commented 5 years ago

Hi @beevabeeva

The current ATM version is already prepared to run as a cluster, but setting it up is currently a responsibility of the user.

All you have to do to have a cluster running is starting multiple worker instances. These worker instances can either be all on the same machine or on different machines, and the only requirements are:

All the machines need to have access to the database being used.
All the machines need to have access to the data in the same way by either having a shared filesystem which is mounted in the same path for all the machines or using an S3 bucket as the dataset source.

For example, if you just wanted to start a cluster with 4 workers on your local machine, all you need to do is running the following two commands:

atm enter_data ...your enter_data options here..
for i in {1..4}; do atm worker ..your worker options here.. > /dev/null & done

The first command will enter your data as usual, and the second one will start 4 workers as background processes, redirecting their outputs to /dev/null to avoid cluttering your console, as you will be able to find their logs in the logs/{your hostname}.txt file anyway.

I hope this helps!

csala commented 5 years ago

Also see #130, which will make cluster management much easier once done.

csala commented 5 years ago

Closed via #133

HDI-Project / ATM

Running ATM on a cluster #128